This poor soul had the exact same problem and solution: http://stackoverflow.com/questions/24083332/write-and-read-raw-byte-arrays-in-spark-using-sequence-file-sequencefile
ᐧ On Tue, Dec 30, 2014 at 10:58 AM, Enno Shioji <eshi...@gmail.com> wrote: > Hi, I'm facing a weird issue. Any help appreciated. > > When I execute the below code and compare "input" and "output", each > record in the output has some extra trailing data appended to it, and hence > corrupted. I'm just reading and writing, so the input and output should be > exactly the same. > > I'm using spark-core 1.2.0_2.10 and the Hadoop bundled in it > (hadoop-common: 2.2.0, hadoop-core: 1.2.1). I also confirmed the binary is > fine at the time it's passed to Hadoop classes, and has already the extra > data when in Hadoop classes (I guess this makes it more of a Hadoop > question...). > > Code: > ===== > def main(args: Array[String]) { > val conf = new SparkConf() > .setMaster("local[4]") > .setAppName("Simple Application") > > val sc = new SparkContext(conf) > > // input.txt is a text file with some Base64 encoded binaries stored as > lines > > val src = sc > .textFile("input.txt") > .map(DatatypeConverter.parseBase64Binary) > .map(x => (NullWritable.get(), new BytesWritable(x))) > .saveAsSequenceFile("s3n://fake-test/stored") > > val file = "s3n://fake-test/stored" > val logData = sc.sequenceFile(file, classOf[NullWritable], > classOf[BytesWritable]) > > val count = logData > .map { case (k, v) => v } > .map(x => DatatypeConverter.printBase64Binary(x.getBytes)) > .saveAsTextFile("/tmp/output") > > } > >