All,

I am having trouble getting a sequence file sorted.  My sequence file is
(Text, Text) and when trying to sort it, Spark complains that it can not
because Text is not serializable.  To get around this issue, I performed a
map on the sequence file to turn it into (String, String).  I then perform
the sort and then write it back out as a sequence file to hdfs.

My issue is that this solution does not scale.  I can run this code for a
32GB file and it runs without issues.  When I run it with at 500GB file, it
runs some of the data nodes out of physical disk space.  It spills like
crazy (usually 2-3 times the amount of original data).  So my 32 GB file
spills 74GB.  

I believe my issue is that there is a better way to get the data into a form
that sort will accept.  Is there a better way to do it other than mapping
the key and value to Strings?

Thanks,

Joe



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sorting-a-Sequence-File-tp15633.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to