Re: Self join

2019-01-30 Thread Marco Gaido
Hi all,

this thread got a bit stuck. Hence, if there are no objections, I'd go
ahead with a design doc describing the solution/workaround I mentioned
before. Any concerns?
Thanks,
Marco

Il giorno gio 13 dic 2018 alle ore 18:15 Ryan Blue  ha
scritto:

> Thanks for the extra context, Marco. I thought you were trying to propose
> a solution.
>
> On Thu, Dec 13, 2018 at 2:45 AM Marco Gaido 
> wrote:
>
>> Hi Ryan,
>>
>> My goal with this email thread is to discuss with the community if there
>> are better ideas (as I was told many other people tried to address this).
>> I'd consider this as a brainstorming email thread. Once we have a good
>> proposal, then we can go ahead with a SPIP.
>>
>> Thanks,
>> Marco
>>
>> Il giorno mer 12 dic 2018 alle ore 19:13 Ryan Blue 
>> ha scritto:
>>
>>> Marco,
>>>
>>> I'm actually asking for a design doc that clearly states the problem and
>>> proposes a solution. This is a substantial change and probably should be an
>>> SPIP.
>>>
>>> I think that would be more likely to generate discussion than referring
>>> to PRs or a quick paragraph on the dev list, because the only people that
>>> are looking at it now are the ones already familiar with the problem.
>>>
>>> rb
>>>
>>> On Wed, Dec 12, 2018 at 2:05 AM Marco Gaido 
>>> wrote:
>>>
 Thank you all for your answers.

 @Ryan Blue  sure, let me state the problem more
 clearly: imagine you have 2 dataframes with a common lineage (for instance
 one is derived from the other by some filtering or anything you prefer).
 And imagine you want to join these 2 dataframes. Currently, there is a fix
 by Reynold which deduplicates the join condition in case the condition is
 an equality one (please notice that in this case, it doesn't matter which
 one is on the left and which one on the right). But if the condition
 involves other comparisons, such as a ">" or a "<", this would result in an
 analysis error, because the attributes on both sides are the same (eg. you
 have the same id#3 attribute on both sides), and you cannot deduplicate
 them blindly as which one is on a specific side matters.

 @Reynold Xin  my proposal was to add a dataset id
 in the metadata of each attribute, so that in this case we can distinguish
 from which dataframe the attribute is coming from, ie. having the
 DataFrames `df1` and `df2` where `df2` is derived from `df1`,
 `df1.join(df2, df1("a") > df2("a"))` could be resolved because we would
 know that the first attribute is taken from `df1` and so it has to be
 resolved using it and the same for the other. But I am open to any approach
 to this problem, if other people have better ideas/suggestions.

 Thanks,
 Marco

 Il giorno mar 11 dic 2018 alle ore 18:31 Jörn Franke <
 jornfra...@gmail.com> ha scritto:

> I don’t know your exact underlying business problem,  but maybe a
> graph solution, such as Spark Graphx meets better your requirements.
> Usually self-joins are done to address some kind of graph problem (even if
> you would not describe it as such) and is for these kind of problems much
> more efficient.
>
> Am 11.12.2018 um 12:44 schrieb Marco Gaido :
>
> Hi all,
>
> I'd like to bring to the attention of a more people a problem which
> has been there for long, ie, self joins. Currently, we have many troubles
> with them. This has been reported several times to the community and seems
> to affect many people, but as of now no solution has been accepted for it.
>
> I created a PR some time ago in order to address the problem (
> https://github.com/apache/spark/pull/21449), but Wenchen mentioned he
> tried to fix this problem too but so far no attempt was successful because
> there is no clear semantic (
> https://github.com/apache/spark/pull/21449#issuecomment-393554552).
>
> So I'd like to propose to discuss here which is the best approach for
> tackling this issue, which I think would be great to fix for 3.0.0, so if
> we decide to introduce breaking changes in the design, we can do that.
>
> Thoughts on this?
>
> Thanks,
> Marco
>
>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Self join

2018-12-13 Thread Ryan Blue
Thanks for the extra context, Marco. I thought you were trying to propose a
solution.

On Thu, Dec 13, 2018 at 2:45 AM Marco Gaido  wrote:

> Hi Ryan,
>
> My goal with this email thread is to discuss with the community if there
> are better ideas (as I was told many other people tried to address this).
> I'd consider this as a brainstorming email thread. Once we have a good
> proposal, then we can go ahead with a SPIP.
>
> Thanks,
> Marco
>
> Il giorno mer 12 dic 2018 alle ore 19:13 Ryan Blue  ha
> scritto:
>
>> Marco,
>>
>> I'm actually asking for a design doc that clearly states the problem and
>> proposes a solution. This is a substantial change and probably should be an
>> SPIP.
>>
>> I think that would be more likely to generate discussion than referring
>> to PRs or a quick paragraph on the dev list, because the only people that
>> are looking at it now are the ones already familiar with the problem.
>>
>> rb
>>
>> On Wed, Dec 12, 2018 at 2:05 AM Marco Gaido 
>> wrote:
>>
>>> Thank you all for your answers.
>>>
>>> @Ryan Blue  sure, let me state the problem more
>>> clearly: imagine you have 2 dataframes with a common lineage (for instance
>>> one is derived from the other by some filtering or anything you prefer).
>>> And imagine you want to join these 2 dataframes. Currently, there is a fix
>>> by Reynold which deduplicates the join condition in case the condition is
>>> an equality one (please notice that in this case, it doesn't matter which
>>> one is on the left and which one on the right). But if the condition
>>> involves other comparisons, such as a ">" or a "<", this would result in an
>>> analysis error, because the attributes on both sides are the same (eg. you
>>> have the same id#3 attribute on both sides), and you cannot deduplicate
>>> them blindly as which one is on a specific side matters.
>>>
>>> @Reynold Xin  my proposal was to add a dataset id
>>> in the metadata of each attribute, so that in this case we can distinguish
>>> from which dataframe the attribute is coming from, ie. having the
>>> DataFrames `df1` and `df2` where `df2` is derived from `df1`,
>>> `df1.join(df2, df1("a") > df2("a"))` could be resolved because we would
>>> know that the first attribute is taken from `df1` and so it has to be
>>> resolved using it and the same for the other. But I am open to any approach
>>> to this problem, if other people have better ideas/suggestions.
>>>
>>> Thanks,
>>> Marco
>>>
>>> Il giorno mar 11 dic 2018 alle ore 18:31 Jörn Franke <
>>> jornfra...@gmail.com> ha scritto:
>>>
 I don’t know your exact underlying business problem,  but maybe a graph
 solution, such as Spark Graphx meets better your requirements. Usually
 self-joins are done to address some kind of graph problem (even if you
 would not describe it as such) and is for these kind of problems much more
 efficient.

 Am 11.12.2018 um 12:44 schrieb Marco Gaido :

 Hi all,

 I'd like to bring to the attention of a more people a problem which has
 been there for long, ie, self joins. Currently, we have many troubles with
 them. This has been reported several times to the community and seems to
 affect many people, but as of now no solution has been accepted for it.

 I created a PR some time ago in order to address the problem (
 https://github.com/apache/spark/pull/21449), but Wenchen mentioned he
 tried to fix this problem too but so far no attempt was successful because
 there is no clear semantic (
 https://github.com/apache/spark/pull/21449#issuecomment-393554552).

 So I'd like to propose to discuss here which is the best approach for
 tackling this issue, which I think would be great to fix for 3.0.0, so if
 we decide to introduce breaking changes in the design, we can do that.

 Thoughts on this?

 Thanks,
 Marco


>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: Self join

2018-12-13 Thread Marco Gaido
Hi Ryan,

My goal with this email thread is to discuss with the community if there
are better ideas (as I was told many other people tried to address this).
I'd consider this as a brainstorming email thread. Once we have a good
proposal, then we can go ahead with a SPIP.

Thanks,
Marco

Il giorno mer 12 dic 2018 alle ore 19:13 Ryan Blue  ha
scritto:

> Marco,
>
> I'm actually asking for a design doc that clearly states the problem and
> proposes a solution. This is a substantial change and probably should be an
> SPIP.
>
> I think that would be more likely to generate discussion than referring to
> PRs or a quick paragraph on the dev list, because the only people that are
> looking at it now are the ones already familiar with the problem.
>
> rb
>
> On Wed, Dec 12, 2018 at 2:05 AM Marco Gaido 
> wrote:
>
>> Thank you all for your answers.
>>
>> @Ryan Blue  sure, let me state the problem more
>> clearly: imagine you have 2 dataframes with a common lineage (for instance
>> one is derived from the other by some filtering or anything you prefer).
>> And imagine you want to join these 2 dataframes. Currently, there is a fix
>> by Reynold which deduplicates the join condition in case the condition is
>> an equality one (please notice that in this case, it doesn't matter which
>> one is on the left and which one on the right). But if the condition
>> involves other comparisons, such as a ">" or a "<", this would result in an
>> analysis error, because the attributes on both sides are the same (eg. you
>> have the same id#3 attribute on both sides), and you cannot deduplicate
>> them blindly as which one is on a specific side matters.
>>
>> @Reynold Xin  my proposal was to add a dataset id
>> in the metadata of each attribute, so that in this case we can distinguish
>> from which dataframe the attribute is coming from, ie. having the
>> DataFrames `df1` and `df2` where `df2` is derived from `df1`,
>> `df1.join(df2, df1("a") > df2("a"))` could be resolved because we would
>> know that the first attribute is taken from `df1` and so it has to be
>> resolved using it and the same for the other. But I am open to any approach
>> to this problem, if other people have better ideas/suggestions.
>>
>> Thanks,
>> Marco
>>
>> Il giorno mar 11 dic 2018 alle ore 18:31 Jörn Franke <
>> jornfra...@gmail.com> ha scritto:
>>
>>> I don’t know your exact underlying business problem,  but maybe a graph
>>> solution, such as Spark Graphx meets better your requirements. Usually
>>> self-joins are done to address some kind of graph problem (even if you
>>> would not describe it as such) and is for these kind of problems much more
>>> efficient.
>>>
>>> Am 11.12.2018 um 12:44 schrieb Marco Gaido :
>>>
>>> Hi all,
>>>
>>> I'd like to bring to the attention of a more people a problem which has
>>> been there for long, ie, self joins. Currently, we have many troubles with
>>> them. This has been reported several times to the community and seems to
>>> affect many people, but as of now no solution has been accepted for it.
>>>
>>> I created a PR some time ago in order to address the problem (
>>> https://github.com/apache/spark/pull/21449), but Wenchen mentioned he
>>> tried to fix this problem too but so far no attempt was successful because
>>> there is no clear semantic (
>>> https://github.com/apache/spark/pull/21449#issuecomment-393554552).
>>>
>>> So I'd like to propose to discuss here which is the best approach for
>>> tackling this issue, which I think would be great to fix for 3.0.0, so if
>>> we decide to introduce breaking changes in the design, we can do that.
>>>
>>> Thoughts on this?
>>>
>>> Thanks,
>>> Marco
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Self join

2018-12-12 Thread Ryan Blue
Marco,

I'm actually asking for a design doc that clearly states the problem and
proposes a solution. This is a substantial change and probably should be an
SPIP.

I think that would be more likely to generate discussion than referring to
PRs or a quick paragraph on the dev list, because the only people that are
looking at it now are the ones already familiar with the problem.

rb

On Wed, Dec 12, 2018 at 2:05 AM Marco Gaido  wrote:

> Thank you all for your answers.
>
> @Ryan Blue  sure, let me state the problem more
> clearly: imagine you have 2 dataframes with a common lineage (for instance
> one is derived from the other by some filtering or anything you prefer).
> And imagine you want to join these 2 dataframes. Currently, there is a fix
> by Reynold which deduplicates the join condition in case the condition is
> an equality one (please notice that in this case, it doesn't matter which
> one is on the left and which one on the right). But if the condition
> involves other comparisons, such as a ">" or a "<", this would result in an
> analysis error, because the attributes on both sides are the same (eg. you
> have the same id#3 attribute on both sides), and you cannot deduplicate
> them blindly as which one is on a specific side matters.
>
> @Reynold Xin  my proposal was to add a dataset id in
> the metadata of each attribute, so that in this case we can distinguish
> from which dataframe the attribute is coming from, ie. having the
> DataFrames `df1` and `df2` where `df2` is derived from `df1`,
> `df1.join(df2, df1("a") > df2("a"))` could be resolved because we would
> know that the first attribute is taken from `df1` and so it has to be
> resolved using it and the same for the other. But I am open to any approach
> to this problem, if other people have better ideas/suggestions.
>
> Thanks,
> Marco
>
> Il giorno mar 11 dic 2018 alle ore 18:31 Jörn Franke 
> ha scritto:
>
>> I don’t know your exact underlying business problem,  but maybe a graph
>> solution, such as Spark Graphx meets better your requirements. Usually
>> self-joins are done to address some kind of graph problem (even if you
>> would not describe it as such) and is for these kind of problems much more
>> efficient.
>>
>> Am 11.12.2018 um 12:44 schrieb Marco Gaido :
>>
>> Hi all,
>>
>> I'd like to bring to the attention of a more people a problem which has
>> been there for long, ie, self joins. Currently, we have many troubles with
>> them. This has been reported several times to the community and seems to
>> affect many people, but as of now no solution has been accepted for it.
>>
>> I created a PR some time ago in order to address the problem (
>> https://github.com/apache/spark/pull/21449), but Wenchen mentioned he
>> tried to fix this problem too but so far no attempt was successful because
>> there is no clear semantic (
>> https://github.com/apache/spark/pull/21449#issuecomment-393554552).
>>
>> So I'd like to propose to discuss here which is the best approach for
>> tackling this issue, which I think would be great to fix for 3.0.0, so if
>> we decide to introduce breaking changes in the design, we can do that.
>>
>> Thoughts on this?
>>
>> Thanks,
>> Marco
>>
>>

-- 
Ryan Blue
Software Engineer
Netflix


Re: Self join

2018-12-12 Thread Marco Gaido
Thank you all for your answers.

@Ryan Blue  sure, let me state the problem more clearly:
imagine you have 2 dataframes with a common lineage (for instance one is
derived from the other by some filtering or anything you prefer). And
imagine you want to join these 2 dataframes. Currently, there is a fix by
Reynold which deduplicates the join condition in case the condition is an
equality one (please notice that in this case, it doesn't matter which one
is on the left and which one on the right). But if the condition involves
other comparisons, such as a ">" or a "<", this would result in an analysis
error, because the attributes on both sides are the same (eg. you have the
same id#3 attribute on both sides), and you cannot deduplicate them blindly
as which one is on a specific side matters.

@Reynold Xin  my proposal was to add a dataset id in
the metadata of each attribute, so that in this case we can distinguish
from which dataframe the attribute is coming from, ie. having the
DataFrames `df1` and `df2` where `df2` is derived from `df1`,
`df1.join(df2, df1("a") > df2("a"))` could be resolved because we would
know that the first attribute is taken from `df1` and so it has to be
resolved using it and the same for the other. But I am open to any approach
to this problem, if other people have better ideas/suggestions.

Thanks,
Marco

Il giorno mar 11 dic 2018 alle ore 18:31 Jörn Franke 
ha scritto:

> I don’t know your exact underlying business problem,  but maybe a graph
> solution, such as Spark Graphx meets better your requirements. Usually
> self-joins are done to address some kind of graph problem (even if you
> would not describe it as such) and is for these kind of problems much more
> efficient.
>
> Am 11.12.2018 um 12:44 schrieb Marco Gaido :
>
> Hi all,
>
> I'd like to bring to the attention of a more people a problem which has
> been there for long, ie, self joins. Currently, we have many troubles with
> them. This has been reported several times to the community and seems to
> affect many people, but as of now no solution has been accepted for it.
>
> I created a PR some time ago in order to address the problem (
> https://github.com/apache/spark/pull/21449), but Wenchen mentioned he
> tried to fix this problem too but so far no attempt was successful because
> there is no clear semantic (
> https://github.com/apache/spark/pull/21449#issuecomment-393554552).
>
> So I'd like to propose to discuss here which is the best approach for
> tackling this issue, which I think would be great to fix for 3.0.0, so if
> we decide to introduce breaking changes in the design, we can do that.
>
> Thoughts on this?
>
> Thanks,
> Marco
>
>


Re: Self join

2018-12-11 Thread Jörn Franke
I don’t know your exact underlying business problem,  but maybe a graph 
solution, such as Spark Graphx meets better your requirements. Usually 
self-joins are done to address some kind of graph problem (even if you would 
not describe it as such) and is for these kind of problems much more efficient. 

> Am 11.12.2018 um 12:44 schrieb Marco Gaido :
> 
> Hi all,
> 
> I'd like to bring to the attention of a more people a problem which has been 
> there for long, ie, self joins. Currently, we have many troubles with them. 
> This has been reported several times to the community and seems to affect 
> many people, but as of now no solution has been accepted for it.
> 
> I created a PR some time ago in order to address the problem 
> (https://github.com/apache/spark/pull/21449), but Wenchen mentioned he tried 
> to fix this problem too but so far no attempt was successful because there is 
> no clear semantic 
> (https://github.com/apache/spark/pull/21449#issuecomment-393554552).
> 
> So I'd like to propose to discuss here which is the best approach for 
> tackling this issue, which I think would be great to fix for 3.0.0, so if we 
> decide to introduce breaking changes in the design, we can do that.
> 
> Thoughts on this?
> 
> Thanks,
> Marco


Re: Self join

2018-12-11 Thread Ryan Blue
Marco,

Thanks for starting the discussion! I think it would be great to have a
clear description of the problem and a proposed solution. Do you have
anything like that? It would help bring the rest of us up to speed without
reading different pull requests.

Thanks!

rb

On Tue, Dec 11, 2018 at 3:54 AM Marco Gaido  wrote:

> Hi all,
>
> I'd like to bring to the attention of a more people a problem which has
> been there for long, ie, self joins. Currently, we have many troubles with
> them. This has been reported several times to the community and seems to
> affect many people, but as of now no solution has been accepted for it.
>
> I created a PR some time ago in order to address the problem (
> https://github.com/apache/spark/pull/21449), but Wenchen mentioned he
> tried to fix this problem too but so far no attempt was successful because
> there is no clear semantic (
> https://github.com/apache/spark/pull/21449#issuecomment-393554552).
>
> So I'd like to propose to discuss here which is the best approach for
> tackling this issue, which I think would be great to fix for 3.0.0, so if
> we decide to introduce breaking changes in the design, we can do that.
>
> Thoughts on this?
>
> Thanks,
> Marco
>


-- 
Ryan Blue
Software Engineer
Netflix