Re: Left join with unbalanced dataset

2016-01-30 Thread Chiwan Park
Hi Arnaud,

To join two datasets, the community recommends using join operation rather than 
cogroup operation. For left join, you can use leftOuterJoin method. Flink’s 
optimizer decides distributed join execution strategy using some statistics of 
the datasets such as size of the dataset. Additionally, you can set join hint 
to help optimizer decide the strategy.

In transformations section [1] of Flink documentation, you can find about outer 
join operation in detail.

I hope this helps.

[1]: 
https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html#transformations

Regards,
Chiwan Park

> On Jan 30, 2016, at 6:43 PM, LINZ, Arnaud  wrote:
> 
> Hello,
> 
> I have a very big dataset A to left join with a dataset B that is half its 
> size. That is to say, half of A records will be matched with one record of B, 
> and the other half with null values.
> 
> I used a CoGroup for that, but my batch fails because yarn kills the 
> container due to memory problems.
> 
> I guess that’s because one worker will get half of A dataset (the unmatched 
> ones), and that’s too much for a single JVM
> 
> Am I right in my diagnostic ? Is there a better way to left join unbalanced 
> datasets ?
> 
> Best regards,
> 
> Arnaud
> 
> 
> 
> L'intégrité de ce message n'étant pas assurée sur internet, la société 
> expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
> jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
> n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
> l'expéditeur.
> 
> The integrity of this message cannot be guaranteed on the Internet. The 
> company that sent this message cannot therefore be held liable for its 
> content nor attachments. Any unauthorized use or dissemination is prohibited. 
> If you are not the intended recipient of this message, then please delete it 
> and notify the sender.



Left join with unbalanced dataset

2016-01-30 Thread LINZ, Arnaud
Hello,

I have a very big dataset A to left join with a dataset B that is half its 
size. That is to say, half of A records will be matched with one record of B, 
and the other half with null values.

I used a CoGroup for that, but my batch fails because yarn kills the container 
due to memory problems.

I guess that’s because one worker will get half of A dataset (the unmatched 
ones), and that’s too much for a single JVM

Am I right in my diagnostic ? Is there a better way to left join unbalanced 
datasets ?

Best regards,

Arnaud




L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.