Dear all,
is the following usage of the Dataframe constructor correct or does it
trigger any side effects that I should be aware of?
My goal is to keep track of my dataframe's state and allow custom
transformations accordingly.
val df: Dataframe = ...some dataframe...
val newDf = new
The Spark documentation shows the following example code:
// Discretize data in 16 equal bins since ChiSqSelector requires categorical
features
val discretizedData = data.map { lp =
LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x = x / 16
} ) )
}
I'm sort of missing why x / 16
I understand that RDDs are not created until an action is called. Is it a
correct conclusion that it doesn't matter if .cache is used anywhere in
the program if I only have one action that is called only once?
Related to this question, consider this situation:
val d1 = data.map((x,y,z) = (x,y))