Per your last comment, it appears I need something like this: https://github.com/RIPE-NCC/hadoop-pcap
Thanks a ton. That get me oriented in the right direction. On Fri, Jan 16, 2015 at 10:20 AM, Sean Owen <so...@cloudera.com> wrote: > Well it looks like you're reading some kind of binary file as text. > That isn't going to work, in Spark or elsewhere, as binary data is not > even necessarily the valid encoding of a string. There are no line > breaks to delimit lines and thus elements of the RDD. > > Your input has some record structure (or else it's not really useful > to put it into an RDD). You can encode this as a SequenceFile and read > it with objectFile. > > You could also write a custom InputFormat that knows how to parse pcap > records directly. > > On Fri, Jan 16, 2015 at 3:09 PM, Nick Allen <n...@nickallen.org> wrote: > > I have an RDD containing binary data. I would like to use 'RDD.pipe' to > pipe > > that binary data to an external program that will translate it to > > string/text data. Unfortunately, it seems that Spark is mangling the > binary > > data before it gets passed to the external program. > > > > This code is representative of what I am trying to do. What am I doing > > wrong? How can I pipe binary data in Spark? Maybe it is getting > corrupted > > when I read it in initially with 'textFile'? > > > > bin = sc.textFile("binary-data.dat") > > csv = bin.pipe ("/usr/bin/binary-to-csv.sh") > > csv.saveAsTextFile("text-data.csv") > > > > Specifically, I am trying to use Spark to transform pcap (packet capture) > > data to text/csv so that I can perform an analysis on it. > > > > Thanks! > > > > -- > > Nick Allen <n...@nickallen.org> > -- Nick Allen <n...@nickallen.org>