[ 
https://issues.apache.org/jira/browse/SPARK-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985860#comment-14985860
 ] 

Narine Kokhlikyan commented on SPARK-11250:
-------------------------------------------

Hi [~davies], [~rxin], [~shivaram]

I have some questions regarding the joins:

1. For creating aliases we would need suffixes. This was an input argument of 
merge in R. We can of course have default values for suffixes, but what do you 
think about having it as an input argument similar to R?

2. Let's say that we have the following two dataframes:
scala> df
res49: org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int]

scala> df2
res50: org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int]

if I do joins like this: df.join(df2) or df.join(df2, df("rating") == 
df2("rating"))
the resulting dataframe has the following structure:
res58: org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int, 
rating: int, income: double, age: int]

as a result, we could have something like this : 
org.apache.spark.sql.DataFrame = [rating_x: int, income_x: double, age_x: int, 
rating_y: int, income_y: double, age_y: int]

or just show like R does:
org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int]

3. Also R adds the suffixes only for the columns which are not in the join 
expression:
for example: df <- merge(iris,iris, by=c("Species"))
the df has the following structure:

colnames(df)
[1] "Species"        "Sepal.Length.x" "Sepal.Width.x"  "Petal.Length.x" 
"Petal.Width.x"  "Sepal.Length.y" "Sepal.Width.y" 
[8] "Petal.Length.y" "Petal.Width.y" 

Do you have any preferences ?

Thanks,
Narine

> Generate different alias for columns with same name during join
> ---------------------------------------------------------------
>
>                 Key: SPARK-11250
>                 URL: https://issues.apache.org/jira/browse/SPARK-11250
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Davies Liu
>            Assignee: Narine Kokhlikyan
>
> It's confusing to see columns with same name after joining, and hard to 
> access them, we could generate different alias for them in joined DataFrame.
> see https://github.com/apache/spark/pull/9012/files#r42696855 as example



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to