Hi all, I got this error when I tried to use the 'join' function to left outer join two data frames in pyspark 1.4.1. Please kindly point out the places where I made mistakes. Thank you.
Traceback (most recent call last): File "/Users/wz/PycharmProjects/PysparkTraining/Airbnb/src/driver.py", line 46, in <module> trainSessionDF = trainDF.join(sessionDF, trainDF.id == sessionDF.user_id, 'left_outer') File "/Users/wz/Downloads/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py", line 701, in __getattr__ "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) AttributeError: 'DataFrame' object has no attribute 'id' - *It does have a column called "id"* 15/12/19 14:15:00 INFO SparkContext: Invoking stop() from shutdown hook Here is the code: from pyspark import SparkContext from pyspark import SparkConf from pyspark.sql import SQLContext from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithSGD from pyspark.sql.functions import * conf = SparkConf().set("spark.executor.memory", "4g") sc = SparkContext(conf= conf) sqlCtx = SQLContext(sc) train = sc.textFile("../train_users_2.csv").map(lambda line: line.split(",")) print train.first() trainDF = sqlCtx.createDataFrame(train) test = sc.textFile("../test_users.csv").map(lambda line: line.split(",")) testDF = sqlCtx.createDataFrame(test) session = sc.textFile("../sessions.csv").map(lambda line: line.split(",")) sessionDF = sqlCtx.createDataFrame(session) # join train with session (Error) trainSessionDF = trainDF.join(sessionDF, trainDF.id == sessionDF.user_id, 'left_outer') Best Regards, WZ