Hi, DataFrame extends RDDApi, that provides RDD like methods. My question is, does DataFrame is sort of stand alone RDD with it?s own partitions or it depends on the underlying RDD that was used to load the data into its partitions? It's written that DataFrame has ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster, but i don't understand if the partitions of data frame are independent of the partitions of the data source that was used to load the data.
So assume theoretically that i used external DataSource API and wrote code that load 1GB of data into single partition. Then I map this DataSource to DataFrame and perform some SQL that returns all the records. Will this DataFrame also has one partition in memory or Spark somehow will divide this DataFrame into various partitions? If so, how it will be divide it into partitions? By size? (can someone point me to the code to see some example)? Thanks, Gil.