Re: Join on DataFrames from the same source (Pyspark)

2015-04-24 Thread Michael Armbrust
fixed in master: https://github.com/apache/spark/commit/2d010f7afe6ac8e67e07da6bea700e9e8c9e6cc2 On Wed, Apr 22, 2015 at 12:19 AM, Karlson wrote: > DataFrames do not have the attributes 'alias' or 'as' in the Python API. > > > On 2015-04-21 20:41, Michael Armbrust wrote: > >> This is https://iss

Re: Join on DataFrames from the same source (Pyspark)

2015-04-22 Thread Karlson
DataFrames do not have the attributes 'alias' or 'as' in the Python API. On 2015-04-21 20:41, Michael Armbrust wrote: This is https://issues.apache.org/jira/browse/SPARK-6231 Unfortunately this is pretty hard to fix as its hard for us to differentiate these without aliases. However you can add

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread Michael Armbrust
This is https://issues.apache.org/jira/browse/SPARK-6231 Unfortunately this is pretty hard to fix as its hard for us to differentiate these without aliases. However you can add an alias as follows: from pyspark.sql.functions import * df.alias("a").join(df.alias("b"), col("a.col1") == col("b.col1

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread ayan guha
you are correct. Just found the same thing. You are better off with sql, then. userSchemaDF = ssc.createDataFrame(userRDD) userSchemaDF.registerTempTable("users") #print userSchemaDF.take(10) #SQL API works as expected sortedDF = ssc.sql("SELECT userId,age,gender,work from users ord

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread Karlson
Sorry, my code actually was df_one = df.select('col1', 'col2') df_two = df.select('col1', 'col3') But in Spark 1.4.0 this does not seem to make any difference anyway and the problem is the same with both versions. On 2015-04-21 17:04, ayan guha wrote: your code should be df_one =

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread ayan guha
your code should be df_one = df.select('col1', 'col2') df_two = df.select('col1', 'col3') Your current code is generating a tupple, and of course df_1 and df_2 are different, so join is yielding to cartesian. Best Ayan On Wed, Apr 22, 2015 at 12:42 AM, Karlson wrote: > Hi, > > can anyone co

Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread Karlson
Hi, can anyone confirm (and if so elaborate on) the following problem? When I join two DataFrames that originate from the same source DataFrame, the resulting DF will explode to a huge number of rows. A quick example: I load a DataFrame with n rows from disk: df = sql_context.parquetFil