Re: How to 'Pipe' Binary Data in Apache Spark

Nick Allen Fri, 16 Jan 2015 07:41:28 -0800

Per your last comment, it appears I need something like this:

https://github.com/RIPE-NCC/hadoop-pcap



Thanks a ton.  That get me oriented in the right direction.

On Fri, Jan 16, 2015 at 10:20 AM, Sean Owen <so...@cloudera.com> wrote:

> Well it looks like you're reading some kind of binary file as text.
> That isn't going to work, in Spark or elsewhere, as binary data is not
> even necessarily the valid encoding of a string. There are no line
> breaks to delimit lines and thus elements of the RDD.
>
> Your input has some record structure (or else it's not really useful
> to put it into an RDD). You can encode this as a SequenceFile and read
> it with objectFile.
>
> You could also write a custom InputFormat that knows how to parse pcap
> records directly.
>
> On Fri, Jan 16, 2015 at 3:09 PM, Nick Allen <n...@nickallen.org> wrote:
> > I have an RDD containing binary data. I would like to use 'RDD.pipe' to
> pipe
> > that binary data to an external program that will translate it to
> > string/text data. Unfortunately, it seems that Spark is mangling the
> binary
> > data before it gets passed to the external program.
> >
> > This code is representative of what I am trying to do. What am I doing
> > wrong? How can I pipe binary data in Spark?  Maybe it is getting
> corrupted
> > when I read it in initially with 'textFile'?
> >
> > bin = sc.textFile("binary-data.dat")
> > csv = bin.pipe ("/usr/bin/binary-to-csv.sh")
> > csv.saveAsTextFile("text-data.csv")
> >
> > Specifically, I am trying to use Spark to transform pcap (packet capture)
> > data to text/csv so that I can perform an analysis on it.
> >
> > Thanks!
> >
> > --
> > Nick Allen <n...@nickallen.org>
>



-- 
Nick Allen <n...@nickallen.org>

Re: How to 'Pipe' Binary Data in Apache Spark

Reply via email to