wuchang created SPARK-19860: ------------------------------- Summary: DataFrame join get conflict error if two frames has a same name column. Key: SPARK-19860 URL: https://issues.apache.org/jira/browse/SPARK-19860 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.0 Reporter: wuchang
>>> print df1.collect() [Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), Row(fdate=u'20170301', in_amount1=10159653)] >>> print df2.collect() [Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), Row(fdate=u'20170301', in_amount2=9475418)] >>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner') 2017-03-08 10:27:34,357 WARN [Thread-2] sql.Column: Constructing trivially true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join jdf = self._jdf.join(other._jdf, on._jc, how) File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u" Failure when resolving conflicting references in Join: 'Join Inner, (fdate#42 = fdate#42) :- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as int) AS in_amount1#97] : +- Filter (inorout#44 = A) : +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42] : +- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201))) : +- SubqueryAlias history_transfer_v : +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, bankwaterid#48, waterid#49, waterstate#50, source#51] : +- SubqueryAlias history_transfer : +- Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51] parquet +- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as int) AS in_amount2#145] +- Filter (inorout#44 = B) +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42] +- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201))) +- SubqueryAlias history_transfer_v +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, bankwaterid#48, waterid#49, waterstate#50, source#51] +- SubqueryAlias history_transfer +- Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51] parquet Conflicting attributes: fdate#42 Only when I use .withColumnRenamed('fdate','fdate2') method to change df1's column fdate to fdate1 and df2's column fdate to fdate2 , the join is ok. So ,my question is ,why the conflict happened? -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org