--
From: kvi...@vt.edu
Date: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org
The file has a specific structure. I outline it below.
The input file is basically a representation of a graph.
INT
INT
: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org
The file has a specific structure. I outline it below.
The input file is basically a representation of a graph.
INT
INT(A)
LONG (B)
A INTs
: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org
The file has a specific structure. I outline it below.
The input file is basically a representation of a graph.
INTINT(A)LONG (B)A INTs(Degrees)A SHORTINTs
Hm, that will indeed be trickier because this method assumes records are the
same byte size. Is the file an arbitrary sequence of mixed types, or is there
structure, e.g. short, long, short, long, etc.?
If you could post a gist with an example of the kind of file and how it should
look once
I think implementing your own InputFormat and using SparkContext.hadoopFile()
is the best option for your case.
Yong
From: kvi...@vt.edu
Date: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org
The file has
Thanks for the reply. Unfortunately, in my case, the binary file is a mix
of short and long integers. Is there any other way that could of use here?
My current method happens to have a large overhead (much more than actual
computation time). Also, I am short of memory at the driver when it has to
If it’s a flat binary file and each record is the same length (in bytes), you
can use Spark’s binaryRecords method (defined on the SparkContext), which loads
records from one or more large flat binary files into an RDD. Here’s an example
in python to show how it works:
# write data from an
The file has a specific structure. I outline it below.
The input file is basically a representation of a graph.
INT
INT(A)
LONG (B)
A INTs(Degrees)
A SHORTINTs (Vertex_Attribute)
B INTs
B INTs
B SHORTINTs
B SHORTINTs
A - number of vertices
B - number of edges