This poor soul had the exact same problem and solution:
http://stackoverflow.com/questions/24083332/write-and-read-raw-byte-arrays-in-spark-using-sequence-file-sequencefile
ᐧ
On Tue, Dec 30, 2014 at 10:58 AM, Enno Shioji eshi...@gmail.com wrote:
Hi, I'm facing a weird issue. Any help appreciated.
When I execute the below code and compare input and output, each
record in the output has some extra trailing data appended to it, and hence
corrupted. I'm just reading and writing, so the input and output should be
exactly the same.
I'm using spark-core 1.2.0_2.10 and the Hadoop bundled in it
(hadoop-common: 2.2.0, hadoop-core: 1.2.1). I also confirmed the binary is
fine at the time it's passed to Hadoop classes, and has already the extra
data when in Hadoop classes (I guess this makes it more of a Hadoop
question...).
Code:
=
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster(local[4])
.setAppName(Simple Application)
val sc = new SparkContext(conf)
// input.txt is a text file with some Base64 encoded binaries stored as
lines
val src = sc
.textFile(input.txt)
.map(DatatypeConverter.parseBase64Binary)
.map(x = (NullWritable.get(), new BytesWritable(x)))
.saveAsSequenceFile(s3n://fake-test/stored)
val file = s3n://fake-test/stored
val logData = sc.sequenceFile(file, classOf[NullWritable],
classOf[BytesWritable])
val count = logData
.map { case (k, v) = v }
.map(x = DatatypeConverter.printBase64Binary(x.getBytes))
.saveAsTextFile(/tmp/output)
}