Re: Will broadcast stream affect performance because of the absence of operator chaining?

Piotr Nowojski Tue, 06 Aug 2019 05:55:39 -0700

Hi,

No, I think you are right, I forgot about the broadcasting requirement.


Piotrek

> On 6 Aug 2019, at 13:11, 黄兆鹏 <paulhu...@easyops.cn> wrote:
> 
> Hi, Piotrek,
> I previously considered your first advice(use union record type), but I found 
> that the schema would be only sent to one subtask of the operator(for 
> example, operatorA), and other subtasks of the operator are not aware of it. 
> In this case is there anything I have missed? 
> 
> Thank you!
> 
> 
> 
> 
> 
>  
> ------------------ Original ------------------
> From:  "Piotr Nowojski"<pi...@ververica.com>;
> Date:  Tue, Aug 6, 2019 06:57 PM
> To:  "黄兆鹏"<paulhu...@easyops.cn>;
> Cc:  "user"<user@flink.apache.org>;
> Subject:  Re: Will broadcast stream affect performance because of the absence 
> of operator chaining?
>  
> Hi,
> 
> Have you measured the performance impact of braking the operator chain?
> 
> This is a current limitation of Flink chaining, that if an operator has two 
> inputs, it can be chained to something else (only one input operators are 
> chained together). There are plans for the future to address this issue.
> 
> As a workaround, besides what you have mentioned:
> - maybe your record type can be a union: type of Record or Schema (not Record 
> AND Schema), and upstream operators (operatorA) could just ignore/forward the 
> Schema. You wouldn’t need to send schema with every record.
> - another (ugly) solution, is to implement BroadcastStream input outside of 
> Flink, but then you might have issues with checkpointing/watermarking and it 
> just makes many things more complicated.
> 
> Piotrek
> 
>> On 6 Aug 2019, at 10:50, 黄兆鹏 <paulhu...@easyops.cn 
>> <mailto:paulhu...@easyops.cn>> wrote:
>> 
>> Hi Piotrek,
>> Thanks for your reply, my broadcast stream just listen to the changes of the 
>> schema, and it's very infrequent and very lightweight.
>> 
>> In fact there are two ways to solve my problem,
>> 
>> the first one is a broadcast stream that listen to the change of the schema, 
>> and broadcast to every operator that will handle the data, just as I posted 
>> originally.
>> DataStream: OperatorA  ->  OperatorB  -> OperatorC
>>                           ^                   ^                      ^
>>                           |                    |                        |
>>                                   BroadcastStream
>> 
>> the second approach is that I have an operator that will join my data and 
>> schema together and send to the downstream operators:
>>  DataStream: MergeSchemaOperator -> OperatorA  ->  OperatorB  -> OperatorC
>>                           ^                 
>>                           |                   
>>                 BroadcastStream           
>> 
>> 
>> The benefits of the first approach is that the flink job does not have to 
>> transfer the schema with the real data records among operators, because the 
>> schema will be broadcasted to each operator.
>> But the disadvantage of the first approache is that it breaks the operator 
>> chain, so operators may not be executed in the same slot and gain worse 
>> performance.
>> 
>> The second approach does not have the problem as the first one, but each 
>> message will carry its schema info among operators, it will cost about 2x 
>> for serialization and deserialization between operators.
>> 
>> Is there a better workaround that all the operators could notice the schema 
>> change and at the same time not breaking the operator chaining?
>> 
>> Thanks!
>> 
>>  
>>  
>> ------------------ Original ------------------
>> From:  "Piotr Nowojski"<pi...@ververica.com <mailto:pi...@ververica.com>>;
>> Date:  Tue, Aug 6, 2019 04:23 PM
>> To:  "黄兆鹏"<paulhu...@easyops.cn <mailto:paulhu...@easyops.cn>>;
>> Cc:  "user"<user@flink.apache.org <mailto:user@flink.apache.org>>;
>> Subject:  Re: Will broadcast stream affect performance because of the 
>> absence of operator chaining?
>>  
>> Hi,
>> 
>> Broadcasting will brake an operator chain. However my best guess is that 
>> Kafka source will be still a performance bottleneck in your job. Also 
>> Network exchanges add some measurable overhead only if your records are very 
>> lightweight and easy to process (for example if you are using RocksDB then 
>> you can just ignore network costs).
>> 
>> Either way, you can just try this out. Pre populate your Kafka topic with 
>> some significant number of messages, run both jobs, compare the throughput 
>> and decide based on those results wether this is ok for you or not.
>> 
>> Piotrek 
>> 
>> > On 6 Aug 2019, at 09:56, 黄兆鹏 <paulhu...@easyops.cn 
>> > <mailto:paulhu...@easyops.cn>> wrote:
>> > 
>> > Hi all, 
>> > My flink job has dynamic schema of data, so I want to consume a schema 
>> > kafka topic and try to broadcast to every operator so that each operator 
>> > could know what kind of data it is handling.
>> > 
>> > For example, the two streams just like this:
>> > OperatorA  ->  OperatorB  -> OperatorC
>> >       ^                   ^                      ^
>> >       |                    |                       |
>> >                BroadcastStream
>> > 
>> > If the broadcast stream does not exist, OperatorA, OperatorB, OperatorC 
>> > are chained together in one slot because they have the same parallelism so 
>> > that it can gain maximum performance.
>> > 
>> > And I was wondering that if the broadcast stream exists, will it affect 
>> > the performance? Or flink will still chain them together to gain maximum 
>> > performance? 
>> > 
>> > Thanks!
>

Re: Will broadcast stream affect performance because of the absence of operator chaining?

Reply via email to