Spark SQL - Actions and Transformations

brccosta Tue, 13 Sep 2016 05:28:13 -0700

Dear all,

We're performing some tests with cache and persist in datasets. In RDD, we
know that the transformations are lazy, being executed only when an action
occurs. So, for example, we put a .cache() in a RDD after an action, which
in turn is executed as the last operations of a sequence of transformations.


However, what are the lazy operations in Datasets and Dataframes? For
example, the following code (fragment):

(df_train, df_test) = df.randomSplit([0.8, 0.2])

r_tokenizer = RegexTokenizer(inputCol="review", outputCol="words_all",
gaps=False, pattern="\\p{L}+")
df_words_all = r_tokenizer.transform(df_train)

remover = StopWordsRemover(inputCol="words_all", outputCol="words_filtered")
df_filtered = remover.transform(df_words_all)
df_filtered = df_filtered.drop('words_all')

hashingTF = HashingTF(inputCol="words_filtered", outputCol="features")
df_features = hashingTF.transform(df_filtered)
df_features = df_features.drop('words_filtered')

lr = LogisticRegression(maxIter=iteractions, regParam=0.01)
model1 = lr.fit(df_features)

evaluator = BinaryClassificationEvaluator()
pipelineModel_features  = PipelineModel (stages=[r_tokenizer, remover,
hashingTF])
df_test_features = pipelineModel_features.transform(df_test)
predictions = model1.transform(df_test_features)
eval_test = evaluator.evaluate(predictions)

All transformations of df_train and df_test will only occur when the
operations fit() and evaluate() are executed?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Actions-and-Transformations-tp27698.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark SQL - Actions and Transformations

Reply via email to