Writing and reading sequence file results in trailing extra data

2014-12-30 Thread Enno Shioji
Hi, I'm facing a weird issue. Any help appreciated.

When I execute the below code and compare input and output, each record
in the output has some extra trailing data appended to it, and hence
corrupted. I'm just reading and writing, so the input and output should be
exactly the same.

I'm using spark-core 1.2.0_2.10 and the Hadoop bundled in it
(hadoop-common: 2.2.0, hadoop-core: 1.2.1). I also confirmed the binary is
fine at the time it's passed to Hadoop classes, and has already the extra
data when in Hadoop classes (I guess this makes it more of a Hadoop
question...).

Code:
=
  def main(args: Array[String]) {
val conf = new SparkConf()
  .setMaster(local[4])
  .setAppName(Simple Application)

val sc = new SparkContext(conf)

   // input.txt is a text file with some Base64 encoded binaries stored as
lines

val src = sc
  .textFile(input.txt)
  .map(DatatypeConverter.parseBase64Binary)
  .map(x = (NullWritable.get(), new BytesWritable(x)))
  .saveAsSequenceFile(s3n://fake-test/stored)

val file = s3n://fake-test/stored
val logData = sc.sequenceFile(file, classOf[NullWritable],
classOf[BytesWritable])

val count = logData
  .map { case (k, v) = v }
  .map(x = DatatypeConverter.printBase64Binary(x.getBytes))
  .saveAsTextFile(/tmp/output)

  }

ᐧ


[SOLVED] Re: Writing and reading sequence file results in trailing extra data

2014-12-30 Thread Enno Shioji
This poor soul had the exact same problem and solution:

http://stackoverflow.com/questions/24083332/write-and-read-raw-byte-arrays-in-spark-using-sequence-file-sequencefile

ᐧ

On Tue, Dec 30, 2014 at 10:58 AM, Enno Shioji eshi...@gmail.com wrote:

 Hi, I'm facing a weird issue. Any help appreciated.

 When I execute the below code and compare input and output, each
 record in the output has some extra trailing data appended to it, and hence
 corrupted. I'm just reading and writing, so the input and output should be
 exactly the same.

 I'm using spark-core 1.2.0_2.10 and the Hadoop bundled in it
 (hadoop-common: 2.2.0, hadoop-core: 1.2.1). I also confirmed the binary is
 fine at the time it's passed to Hadoop classes, and has already the extra
 data when in Hadoop classes (I guess this makes it more of a Hadoop
 question...).

 Code:
 =
   def main(args: Array[String]) {
 val conf = new SparkConf()
   .setMaster(local[4])
   .setAppName(Simple Application)

 val sc = new SparkContext(conf)

// input.txt is a text file with some Base64 encoded binaries stored as
 lines

 val src = sc
   .textFile(input.txt)
   .map(DatatypeConverter.parseBase64Binary)
   .map(x = (NullWritable.get(), new BytesWritable(x)))
   .saveAsSequenceFile(s3n://fake-test/stored)

 val file = s3n://fake-test/stored
 val logData = sc.sequenceFile(file, classOf[NullWritable],
 classOf[BytesWritable])

 val count = logData
   .map { case (k, v) = v }
   .map(x = DatatypeConverter.printBase64Binary(x.getBytes))
   .saveAsTextFile(/tmp/output)

   }