Hi Experts, I am confusing on the input parameters of GenSort.scala and encountered strange issues. It requires 3 parameters: " [num-parts] [records-per-part] [output-path]". Like Hadoop, I think the sizing of any one row(or record) of the sorting file equals to 100 bytes. So if I want to generate and sort 100 GB data using 4 partitions, is that correct to set the parameters as '4, 268435456, /tmp/sort-output'? I computed the records(rows) number as following way:
100 GB data = 107374182400 byte = 1073741824 row * 100 byte/row = 268435456 row * 4 partition * 100 byte/row So each partition should compute 268435456 row(record), right? However, If I save the output as sequence file, the size of output files is only 20.8 GB(5.2 GB * 4 partition). if I save the output as text file, not sequence file, the size of output files is 309.2 GB(77.3 GB * 4 partition), but NOT 100 GB. Why? Thanks! -------------------------------- Sam Liu