I have two questions First,I have a failure when I write parquet from Spark 1.6.1 on Amazon EMR to S3. This is full batch, which is over 200GB of source data. The partitioning is based on a geographic identifier we use, and also a date we got the data. However, because of geographical density we certainly could be hitting the fact we are getting tiles too dense. I’m trying to figure out how to figure out the size of the file it’s trying to write out.
Second, We use to use RDDs and RangePartitioner for task partitioning. However, I don’t see this available in DataFrames. How does one achieve this now. Peter Halliday --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org