[ https://issues.apache.org/jira/browse/SPARK-19860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808530#comment-16808530 ]
Martin Studer commented on SPARK-19860: --------------------------------------- We're observing the same issue with pyspark 2.3.0. It happens on an inner join of two data frames which have one single column in common (the join column). If I rename one of the columns as mentioned by [~wuchang1989] and then use a join expression the join succeeds. Isolating the problem seems difficult as it happens only in the context of a larger pipeline. > DataFrame join get conflict error if two frames has a same name column. > ----------------------------------------------------------------------- > > Key: SPARK-19860 > URL: https://issues.apache.org/jira/browse/SPARK-19860 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.1.0 > Reporter: wuchang > Priority: Major > > {code} > >>> print df1.collect() > [Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', > in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), > Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', > in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), > Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', > in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), > Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', > in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), > Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', > in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), > Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', > in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), > Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', > in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), > Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', > in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), > Row(fdate=u'20170301', in_amount1=10159653)] > >>> print df2.collect() > [Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', > in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), > Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', > in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), > Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', > in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), > Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', > in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), > Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', > in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), > Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', > in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), > Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', > in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), > Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', > in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), > Row(fdate=u'20170301', in_amount2=9475418)] > >>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner') > 2017-03-08 10:27:34,357 WARN [Thread-2] sql.Column: Constructing trivially > true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases. > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join > jdf = self._jdf.join(other._jdf, on._jc, how) > File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.AnalysisException: u" > Failure when resolving conflicting references in Join: > 'Join Inner, (fdate#42 = fdate#42) > :- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) > as int) AS in_amount1#97] > : +- Filter (inorout#44 = A) > : +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, > fdate#42] > : +- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && > (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201))) > : +- SubqueryAlias history_transfer_v > : +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, > fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, > bankwaterid#48, waterid#49, waterstate#50, source#51] > : +- SubqueryAlias history_transfer > : +- > Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51] > parquet > +- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) > as int) AS in_amount2#145] > +- Filter (inorout#44 = B) > +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, > fdate#42] > +- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && > (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201))) > +- SubqueryAlias history_transfer_v > +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, > fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, > bankwaterid#48, waterid#49, waterstate#50, source#51] > +- SubqueryAlias history_transfer > +- > Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51] > parquet > > Conflicting attributes: fdate#42 > {code} > Only when I use .withColumnRenamed('fdate','fdate2') method to change df1's > column fdate to fdate1 and df2's column fdate to fdate2 , the join is ok. > So ,my question is ,why the conflict happened? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org