Re: Self join

Marco Gaido Wed, 12 Dec 2018 02:06:08 -0800

Thank you all for your answers.

@Ryan Blue <rb...@netflix.com> sure, let me state the problem more clearly:
imagine you have 2 dataframes with a common lineage (for instance one is
derived from the other by some filtering or anything you prefer). And
imagine you want to join these 2 dataframes. Currently, there is a fix by
Reynold which deduplicates the join condition in case the condition is an
equality one (please notice that in this case, it doesn't matter which one
is on the left and which one on the right). But if the condition involves
other comparisons, such as a ">" or a "<", this would result in an analysis
error, because the attributes on both sides are the same (eg. you have the
same id#3 attribute on both sides), and you cannot deduplicate them blindly
as which one is on a specific side matters.


@Reynold Xin <r...@databricks.com> my proposal was to add a dataset id in
the metadata of each attribute, so that in this case we can distinguish
from which dataframe the attribute is coming from, ie. having the
DataFrames `df1` and `df2` where `df2` is derived from `df1`,
`df1.join(df2, df1("a") > df2("a"))` could be resolved because we would
know that the first attribute is taken from `df1` and so it has to be
resolved using it and the same for the other. But I am open to any approach
to this problem, if other people have better ideas/suggestions.

Thanks,
Marco

Il giorno mar 11 dic 2018 alle ore 18:31 Jörn Franke <jornfra...@gmail.com>
ha scritto:

> I don’t know your exact underlying business problem,  but maybe a graph
> solution, such as Spark Graphx meets better your requirements. Usually
> self-joins are done to address some kind of graph problem (even if you
> would not describe it as such) and is for these kind of problems much more
> efficient.
>
> Am 11.12.2018 um 12:44 schrieb Marco Gaido <marcogaid...@gmail.com>:
>
> Hi all,
>
> I'd like to bring to the attention of a more people a problem which has
> been there for long, ie, self joins. Currently, we have many troubles with
> them. This has been reported several times to the community and seems to
> affect many people, but as of now no solution has been accepted for it.
>
> I created a PR some time ago in order to address the problem (
> https://github.com/apache/spark/pull/21449), but Wenchen mentioned he
> tried to fix this problem too but so far no attempt was successful because
> there is no clear semantic (
> https://github.com/apache/spark/pull/21449#issuecomment-393554552).
>
> So I'd like to propose to discuss here which is the best approach for
> tackling this issue, which I think would be great to fix for 3.0.0, so if
> we decide to introduce breaking changes in the design, we can do that.
>
> Thoughts on this?
>
> Thanks,
> Marco
>
>

Re: Self join

Reply via email to