Hi all, this thread got a bit stuck. Hence, if there are no objections, I'd go ahead with a design doc describing the solution/workaround I mentioned before. Any concerns? Thanks, Marco
Il giorno gio 13 dic 2018 alle ore 18:15 Ryan Blue <rb...@netflix.com> ha scritto: > Thanks for the extra context, Marco. I thought you were trying to propose > a solution. > > On Thu, Dec 13, 2018 at 2:45 AM Marco Gaido <marcogaid...@gmail.com> > wrote: > >> Hi Ryan, >> >> My goal with this email thread is to discuss with the community if there >> are better ideas (as I was told many other people tried to address this). >> I'd consider this as a brainstorming email thread. Once we have a good >> proposal, then we can go ahead with a SPIP. >> >> Thanks, >> Marco >> >> Il giorno mer 12 dic 2018 alle ore 19:13 Ryan Blue <rb...@netflix.com> >> ha scritto: >> >>> Marco, >>> >>> I'm actually asking for a design doc that clearly states the problem and >>> proposes a solution. This is a substantial change and probably should be an >>> SPIP. >>> >>> I think that would be more likely to generate discussion than referring >>> to PRs or a quick paragraph on the dev list, because the only people that >>> are looking at it now are the ones already familiar with the problem. >>> >>> rb >>> >>> On Wed, Dec 12, 2018 at 2:05 AM Marco Gaido <marcogaid...@gmail.com> >>> wrote: >>> >>>> Thank you all for your answers. >>>> >>>> @Ryan Blue <rb...@netflix.com> sure, let me state the problem more >>>> clearly: imagine you have 2 dataframes with a common lineage (for instance >>>> one is derived from the other by some filtering or anything you prefer). >>>> And imagine you want to join these 2 dataframes. Currently, there is a fix >>>> by Reynold which deduplicates the join condition in case the condition is >>>> an equality one (please notice that in this case, it doesn't matter which >>>> one is on the left and which one on the right). But if the condition >>>> involves other comparisons, such as a ">" or a "<", this would result in an >>>> analysis error, because the attributes on both sides are the same (eg. you >>>> have the same id#3 attribute on both sides), and you cannot deduplicate >>>> them blindly as which one is on a specific side matters. >>>> >>>> @Reynold Xin <r...@databricks.com> my proposal was to add a dataset id >>>> in the metadata of each attribute, so that in this case we can distinguish >>>> from which dataframe the attribute is coming from, ie. having the >>>> DataFrames `df1` and `df2` where `df2` is derived from `df1`, >>>> `df1.join(df2, df1("a") > df2("a"))` could be resolved because we would >>>> know that the first attribute is taken from `df1` and so it has to be >>>> resolved using it and the same for the other. But I am open to any approach >>>> to this problem, if other people have better ideas/suggestions. >>>> >>>> Thanks, >>>> Marco >>>> >>>> Il giorno mar 11 dic 2018 alle ore 18:31 Jörn Franke < >>>> jornfra...@gmail.com> ha scritto: >>>> >>>>> I don’t know your exact underlying business problem, but maybe a >>>>> graph solution, such as Spark Graphx meets better your requirements. >>>>> Usually self-joins are done to address some kind of graph problem (even if >>>>> you would not describe it as such) and is for these kind of problems much >>>>> more efficient. >>>>> >>>>> Am 11.12.2018 um 12:44 schrieb Marco Gaido <marcogaid...@gmail.com>: >>>>> >>>>> Hi all, >>>>> >>>>> I'd like to bring to the attention of a more people a problem which >>>>> has been there for long, ie, self joins. Currently, we have many troubles >>>>> with them. This has been reported several times to the community and seems >>>>> to affect many people, but as of now no solution has been accepted for it. >>>>> >>>>> I created a PR some time ago in order to address the problem ( >>>>> https://github.com/apache/spark/pull/21449), but Wenchen mentioned he >>>>> tried to fix this problem too but so far no attempt was successful because >>>>> there is no clear semantic ( >>>>> https://github.com/apache/spark/pull/21449#issuecomment-393554552). >>>>> >>>>> So I'd like to propose to discuss here which is the best approach for >>>>> tackling this issue, which I think would be great to fix for 3.0.0, so if >>>>> we decide to introduce breaking changes in the design, we can do that. >>>>> >>>>> Thoughts on this? >>>>> >>>>> Thanks, >>>>> Marco >>>>> >>>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >