Hello, I have 2 text file in the following form and my goal is to calculate the Pearson correlation between them using sliding window in pyspark:
123.00 -12.00 334.00 . . . First I read these 2 text file and store them in RDD format and then I apply the window operation on each RDD but I keep getting this error: * AttributeError: 'PipelinedRDD' object has no attribute window* Here is my code: if __name__ == "__main__": spark = SparkSession.builder.appName("CrossCorrelation").getOrCreate() # DEFINE your input path input_path1 = sys.argv[1] input_path2 = sys.argv[2] num_of_partitions = 4 rdd1 = spark.sparkContext.textFile(input_path1, num_of_partitions).flatMap(lambda line1: line1.split("\n").strip()).map(lambda strelem1: float(strelem1)) rdd2 = spark.sparkContext.textFile(input_path2, num_of_partitions).flatMap(lambda line2: line2.split("\n").strip()).map(lambda strelem2: float(strelem2)) #Windowing windowedrdd1= rdd1.window(3,2) windowedrdd2= rdd2.window(3,2) #Correlation between sliding windows CrossCorr = Statistics.corr(windowedrdd1, windowedrdd2, method="pearson") if CrossCorr >= 0.7: print("rdd1 & rdd2 are correlated") I know from the error that window operation is only for DStream but since I have RDD here how I can do window operation on RDDs? Thank you, Zeinab -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org