Hi, I thought I understood RDDs and DataFrames, but one noob thing is bugging me (because I'm seeing weird errors involving joins):
*What does Spark do when you pass a big dataframe as an argument to a function? * Are these dataframes included in the closure of the function, and is therefore each big argument dataframe shipped off to each node, instead of respecting the locality of the distributed data of the dataframe? Or, are the *references* to these distributed objects included in the closure, and not the objects themselves? I was assuming the latter is true, but am not sure anymore. For example, I wrote a function called myLeftJoin(df1, df2, Seq(columns)) that merges df1 and df2 based on multiple columns present in both, but I don't want an equijoin, I want a left join and I don't want repeated columns. Inside the function I build an sql statement to execute. Will this construct be inefficient? Will it exhaust memory somehow due to passing df1 and df2 in the function? I had instances of when calling a function in spark-shell would produce an error "unable to acquire xxxx bytes of memory", but subsequently, not wrapping the code in a function, and instead pasting the function body in the spark-shell would not produce the same error. So, does calling a function in Spark include a memory overhead somehow? Thanks for any clarifications! Kristina