All, I am having trouble getting a sequence file sorted. My sequence file is (Text, Text) and when trying to sort it, Spark complains that it can not because Text is not serializable. To get around this issue, I performed a map on the sequence file to turn it into (String, String). I then perform the sort and then write it back out as a sequence file to hdfs.
My issue is that this solution does not scale. I can run this code for a 32GB file and it runs without issues. When I run it with at 500GB file, it runs some of the data nodes out of physical disk space. It spills like crazy (usually 2-3 times the amount of original data). So my 32 GB file spills 74GB. I believe my issue is that there is a better way to get the data into a form that sort will accept. Is there a better way to do it other than mapping the key and value to Strings? Thanks, Joe -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sorting-a-Sequence-File-tp15633.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org