Re: Self join

Marco Gaido Wed, 30 Jan 2019 03:00:47 -0800

Hi all,

this thread got a bit stuck. Hence, if there are no objections, I'd go
ahead with a design doc describing the solution/workaround I mentioned
before. Any concerns?
Thanks,
Marco


Il giorno gio 13 dic 2018 alle ore 18:15 Ryan Blue <rb...@netflix.com> ha
scritto:

> Thanks for the extra context, Marco. I thought you were trying to propose
> a solution.
>
> On Thu, Dec 13, 2018 at 2:45 AM Marco Gaido <marcogaid...@gmail.com>
> wrote:
>
>> Hi Ryan,
>>
>> My goal with this email thread is to discuss with the community if there
>> are better ideas (as I was told many other people tried to address this).
>> I'd consider this as a brainstorming email thread. Once we have a good
>> proposal, then we can go ahead with a SPIP.
>>
>> Thanks,
>> Marco
>>
>> Il giorno mer 12 dic 2018 alle ore 19:13 Ryan Blue <rb...@netflix.com>
>> ha scritto:
>>
>>> Marco,
>>>
>>> I'm actually asking for a design doc that clearly states the problem and
>>> proposes a solution. This is a substantial change and probably should be an
>>> SPIP.
>>>
>>> I think that would be more likely to generate discussion than referring
>>> to PRs or a quick paragraph on the dev list, because the only people that
>>> are looking at it now are the ones already familiar with the problem.
>>>
>>> rb
>>>
>>> On Wed, Dec 12, 2018 at 2:05 AM Marco Gaido <marcogaid...@gmail.com>
>>> wrote:
>>>
>>>> Thank you all for your answers.
>>>>
>>>> @Ryan Blue <rb...@netflix.com> sure, let me state the problem more
>>>> clearly: imagine you have 2 dataframes with a common lineage (for instance
>>>> one is derived from the other by some filtering or anything you prefer).
>>>> And imagine you want to join these 2 dataframes. Currently, there is a fix
>>>> by Reynold which deduplicates the join condition in case the condition is
>>>> an equality one (please notice that in this case, it doesn't matter which
>>>> one is on the left and which one on the right). But if the condition
>>>> involves other comparisons, such as a ">" or a "<", this would result in an
>>>> analysis error, because the attributes on both sides are the same (eg. you
>>>> have the same id#3 attribute on both sides), and you cannot deduplicate
>>>> them blindly as which one is on a specific side matters.
>>>>
>>>> @Reynold Xin <r...@databricks.com> my proposal was to add a dataset id
>>>> in the metadata of each attribute, so that in this case we can distinguish
>>>> from which dataframe the attribute is coming from, ie. having the
>>>> DataFrames `df1` and `df2` where `df2` is derived from `df1`,
>>>> `df1.join(df2, df1("a") > df2("a"))` could be resolved because we would
>>>> know that the first attribute is taken from `df1` and so it has to be
>>>> resolved using it and the same for the other. But I am open to any approach
>>>> to this problem, if other people have better ideas/suggestions.
>>>>
>>>> Thanks,
>>>> Marco
>>>>
>>>> Il giorno mar 11 dic 2018 alle ore 18:31 Jörn Franke <
>>>> jornfra...@gmail.com> ha scritto:
>>>>
>>>>> I don’t know your exact underlying business problem,  but maybe a
>>>>> graph solution, such as Spark Graphx meets better your requirements.
>>>>> Usually self-joins are done to address some kind of graph problem (even if
>>>>> you would not describe it as such) and is for these kind of problems much
>>>>> more efficient.
>>>>>
>>>>> Am 11.12.2018 um 12:44 schrieb Marco Gaido <marcogaid...@gmail.com>:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I'd like to bring to the attention of a more people a problem which
>>>>> has been there for long, ie, self joins. Currently, we have many troubles
>>>>> with them. This has been reported several times to the community and seems
>>>>> to affect many people, but as of now no solution has been accepted for it.
>>>>>
>>>>> I created a PR some time ago in order to address the problem (
>>>>> https://github.com/apache/spark/pull/21449), but Wenchen mentioned he
>>>>> tried to fix this problem too but so far no attempt was successful because
>>>>> there is no clear semantic (
>>>>> https://github.com/apache/spark/pull/21449#issuecomment-393554552).
>>>>>
>>>>> So I'd like to propose to discuss here which is the best approach for
>>>>> tackling this issue, which I think would be great to fix for 3.0.0, so if
>>>>> we decide to introduce breaking changes in the design, we can do that.
>>>>>
>>>>> Thoughts on this?
>>>>>
>>>>> Thanks,
>>>>> Marco
>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Self join

Reply via email to