Union is just combining data from multiple sources into a single dataset. 

That’s it. No memory, no disk involved.


In you case you have


input1.union(input2).groupBy(1).reduce(…)


This will translate into:


input1 -> repartition -> 

                                        read-both-inputs ->  sort -> reduce

input2 -> repartition ->


So, in your case not even additional network transfer is involved, because both 
data sets would need to be partitioned for the reduce anyway.


Note, union in Flink has SQL union-all semantics, i.e., there is not removal of 
duplicates.


Cheers, Fabian 






From: Flavio Pompermaier
Sent: ‎Monday‎, ‎22‎. ‎December‎, ‎2014 ‎14‎:‎32
To: [email protected]





Ok thanks Fabian. I'd like just to know the internals of the union of multiple 
datasets (partitioning, distribution among server, memory/disk, etc..). Do you 
have any ref to this?



Thanks in advance,

Flavio



On Mon, Dec 22, 2014 at 12:46 PM, Fabian Hueske <[email protected]> wrote:


Follow the first approach. 
Joins are expensive, union comes for free.





Best, Fabian





2014-12-22 11:47 GMT+01:00 Flavio Pompermaier <[email protected]>:


Hi guys,






In my use case I have multiple Datasets with the same structure (e.g. Tuple3) 
and I want to produce an output Dataset containing all Tuple3 grouped by the 
first field (0).

I can obtain the same results performing a union of all datasets and then a 
group by (simplest implementation) or join all of them pairwise 
(((A->B)->C)->D)..) or I don't know if there is any other solution. When should 
I use the first or the second approach? Could you help me in figuring out the 
internals of the two approaches? I always have some fear when using multiple 
joins when I don't know exactly their size..




Best,

Flavio

Reply via email to