fixed in master:
https://github.com/apache/spark/commit/2d010f7afe6ac8e67e07da6bea700e9e8c9e6cc2
On Wed, Apr 22, 2015 at 12:19 AM, Karlson wrote:
> DataFrames do not have the attributes 'alias' or 'as' in the Python API.
>
>
> On 2015-04-21 20:41, Michael Armbrust wrote:
>
>> This is https://iss
DataFrames do not have the attributes 'alias' or 'as' in the Python API.
On 2015-04-21 20:41, Michael Armbrust wrote:
This is https://issues.apache.org/jira/browse/SPARK-6231
Unfortunately this is pretty hard to fix as its hard for us to
differentiate these without aliases. However you can add
This is https://issues.apache.org/jira/browse/SPARK-6231
Unfortunately this is pretty hard to fix as its hard for us to
differentiate these without aliases. However you can add an alias as
follows:
from pyspark.sql.functions import *
df.alias("a").join(df.alias("b"), col("a.col1") == col("b.col1
you are correct. Just found the same thing. You are better off with sql,
then.
userSchemaDF = ssc.createDataFrame(userRDD)
userSchemaDF.registerTempTable("users")
#print userSchemaDF.take(10)
#SQL API works as expected
sortedDF = ssc.sql("SELECT userId,age,gender,work from users ord
Sorry, my code actually was
df_one = df.select('col1', 'col2')
df_two = df.select('col1', 'col3')
But in Spark 1.4.0 this does not seem to make any difference anyway and
the problem is the same with both versions.
On 2015-04-21 17:04, ayan guha wrote:
your code should be
df_one =
your code should be
df_one = df.select('col1', 'col2')
df_two = df.select('col1', 'col3')
Your current code is generating a tupple, and of course df_1 and df_2 are
different, so join is yielding to cartesian.
Best
Ayan
On Wed, Apr 22, 2015 at 12:42 AM, Karlson wrote:
> Hi,
>
> can anyone co
Hi,
can anyone confirm (and if so elaborate on) the following problem?
When I join two DataFrames that originate from the same source
DataFrame, the resulting DF will explode to a huge number of rows. A
quick example:
I load a DataFrame with n rows from disk:
df = sql_context.parquetFil