how to investigate skew and DataFrames and RangePartitioner

Peter Halliday Mon, 13 Jun 2016 13:04:55 -0700

I have two questions

First,I have a failure when I write parquet from Spark 1.6.1 on Amazon EMR to 
S3.  This is full batch, which is over 200GB of source data.  The partitioning 
is based on a geographic identifier we use, and also a date we got the data.  
However, because of geographical density we certainly could be hitting the fact 
we are getting tiles too dense.  I’m trying to figure out how to figure out the 
size of the file it’s trying to write out.


Second, We use to use RDDs and RangePartitioner for task partitioning.  
However, I don’t see this available in DataFrames.  How does one achieve this 
now.

Peter Halliday
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

how to investigate skew and DataFrames and RangePartitioner

Reply via email to