Ok thanks Fabian. I'd like just to know the internals of the union of multiple datasets (partitioning, distribution among server, memory/disk, etc..). Do you have any ref to this?
Thanks in advance, Flavio On Mon, Dec 22, 2014 at 12:46 PM, Fabian Hueske <[email protected]> wrote: > Follow the first approach. > Joins are expensive, union comes for free. > > Best, Fabian > > 2014-12-22 11:47 GMT+01:00 Flavio Pompermaier <[email protected]>: > >> Hi guys, >> >> In my use case I have multiple Datasets with the same structure (e.g. >> Tuple3) and I want to produce an output Dataset containing all Tuple3 >> grouped by the first field (0). >> I can obtain the same results performing a union of all datasets and then >> a group by (simplest implementation) or join all of them pairwise >> (((A->B)->C)->D)..) or I don't know if there is any other solution. When >> should I use the first or the second approach? Could you help me in >> figuring out the internals of the two approaches? I always have some fear >> when using multiple joins when I don't know exactly their size.. >> >> Best, >> Flavio >> > >
