you are correct. Just found the same thing. You are better off with sql, then.
userSchemaDF = ssc.createDataFrame(userRDD) userSchemaDF.registerTempTable("users") #print userSchemaDF.take(10) #SQL API works as expected sortedDF = ssc.sql("SELECT userId,age,gender,work from users order by work,age") #print sortedDF.take(10) df1 = userSchemaDF.select("userId","age") df1.registerTempTable("df1") df2 = userSchemaDF.select("userId","work") df2.registerTempTable("df2") dfjs = ssc.sql("select age,work from df1 d1 inner join df2 d2 on d1.userId=d2.userId") print dfjs.count() #DF API does not dfj = df1.join(df2,df1.userId==df2.userId,"inner") print dfj.count() 943 889249 On Wed, Apr 22, 2015 at 1:10 AM, Karlson <ksonsp...@siberie.de> wrote: > Sorry, my code actually was > > df_one = df.select('col1', 'col2') > df_two = df.select('col1', 'col3') > > But in Spark 1.4.0 this does not seem to make any difference anyway and > the problem is the same with both versions. > > > > On 2015-04-21 17:04, ayan guha wrote: > >> your code should be >> >> df_one = df.select('col1', 'col2') >> df_two = df.select('col1', 'col3') >> >> Your current code is generating a tupple, and of course df_1 and df_2 are >> different, so join is yielding to cartesian. >> >> Best >> Ayan >> >> On Wed, Apr 22, 2015 at 12:42 AM, Karlson <ksonsp...@siberie.de> wrote: >> >> Hi, >>> >>> can anyone confirm (and if so elaborate on) the following problem? >>> >>> When I join two DataFrames that originate from the same source DataFrame, >>> the resulting DF will explode to a huge number of rows. A quick example: >>> >>> I load a DataFrame with n rows from disk: >>> >>> df = sql_context.parquetFile('data.parquet') >>> >>> Then I create two DataFrames from that source. >>> >>> df_one = df.select(['col1', 'col2']) >>> df_two = df.select(['col1', 'col3']) >>> >>> Finally I want to (inner) join them back together: >>> >>> df_joined = df_one.join(df_two, df_one['col1'] == df_two['col2'], >>> 'inner') >>> >>> The key in col1 is unique. The resulting DataFrame should have n rows, >>> however it does have n*n rows. >>> >>> That does not happen, when I load df_one and df_two from disk directly. I >>> am on Spark 1.3.0, but this also happens on the current 1.4.0 snapshot. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> -- Best Regards, Ayan Guha