Dear all, We're performing some tests with cache and persist in datasets. In RDD, we know that the transformations are lazy, being executed only when an action occurs. So, for example, we put a .cache() in a RDD after an action, which in turn is executed as the last operations of a sequence of transformations.
However, what are the lazy operations in Datasets and Dataframes? For example, the following code (fragment): (df_train, df_test) = df.randomSplit([0.8, 0.2]) r_tokenizer = RegexTokenizer(inputCol="review", outputCol="words_all", gaps=False, pattern="\\p{L}+") df_words_all = r_tokenizer.transform(df_train) remover = StopWordsRemover(inputCol="words_all", outputCol="words_filtered") df_filtered = remover.transform(df_words_all) df_filtered = df_filtered.drop('words_all') hashingTF = HashingTF(inputCol="words_filtered", outputCol="features") df_features = hashingTF.transform(df_filtered) df_features = df_features.drop('words_filtered') lr = LogisticRegression(maxIter=iteractions, regParam=0.01) model1 = lr.fit(df_features) evaluator = BinaryClassificationEvaluator() pipelineModel_features = PipelineModel (stages=[r_tokenizer, remover, hashingTF]) df_test_features = pipelineModel_features.transform(df_test) predictions = model1.transform(df_test_features) eval_test = evaluator.evaluate(predictions) All transformations of df_train and df_test will only occur when the operations fit() and evaluate() are executed? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Actions-and-Transformations-tp27698.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org