Hi, I'm facing a weird issue. Any help appreciated. When I execute the below code and compare "input" and "output", each record in the output has some extra trailing data appended to it, and hence corrupted. I'm just reading and writing, so the input and output should be exactly the same.
I'm using spark-core 1.2.0_2.10 and the Hadoop bundled in it (hadoop-common: 2.2.0, hadoop-core: 1.2.1). I also confirmed the binary is fine at the time it's passed to Hadoop classes, and has already the extra data when in Hadoop classes (I guess this makes it more of a Hadoop question...). Code: ===== def main(args: Array[String]) { val conf = new SparkConf() .setMaster("local[4]") .setAppName("Simple Application") val sc = new SparkContext(conf) // input.txt is a text file with some Base64 encoded binaries stored as lines val src = sc .textFile("input.txt") .map(DatatypeConverter.parseBase64Binary) .map(x => (NullWritable.get(), new BytesWritable(x))) .saveAsSequenceFile("s3n://fake-test/stored") val file = "s3n://fake-test/stored" val logData = sc.sequenceFile(file, classOf[NullWritable], classOf[BytesWritable]) val count = logData .map { case (k, v) => v } .map(x => DatatypeConverter.printBase64Binary(x.getBytes)) .saveAsTextFile("/tmp/output") } ᐧ