Re: Reading a large file (binary) into RDD

2015-04-03 Thread Dean Wampler
-- From: kvi...@vt.edu Date: Thu, 2 Apr 2015 17:31:30 -0400 Subject: Re: Reading a large file (binary) into RDD To: freeman.jer...@gmail.com CC: user@spark.apache.org The file has a specific structure. I outline it below. The input file is basically a representation of a graph. INT INT

Re: Reading a large file (binary) into RDD

2015-04-03 Thread Vijayasarathy Kannan
: Thu, 2 Apr 2015 17:31:30 -0400 Subject: Re: Reading a large file (binary) into RDD To: freeman.jer...@gmail.com CC: user@spark.apache.org The file has a specific structure. I outline it below. The input file is basically a representation of a graph. INT INT(A) LONG (B) A INTs

RE: Reading a large file (binary) into RDD

2015-04-03 Thread java8964
: Reading a large file (binary) into RDD To: freeman.jer...@gmail.com CC: user@spark.apache.org The file has a specific structure. I outline it below. The input file is basically a representation of a graph. INTINT(A)LONG (B)A INTs(Degrees)A SHORTINTs

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Jeremy Freeman
Hm, that will indeed be trickier because this method assumes records are the same byte size. Is the file an arbitrary sequence of mixed types, or is there structure, e.g. short, long, short, long, etc.? If you could post a gist with an example of the kind of file and how it should look once

RE: Reading a large file (binary) into RDD

2015-04-02 Thread java8964
I think implementing your own InputFormat and using SparkContext.hadoopFile() is the best option for your case. Yong From: kvi...@vt.edu Date: Thu, 2 Apr 2015 17:31:30 -0400 Subject: Re: Reading a large file (binary) into RDD To: freeman.jer...@gmail.com CC: user@spark.apache.org The file has

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
Thanks for the reply. Unfortunately, in my case, the binary file is a mix of short and long integers. Is there any other way that could of use here? My current method happens to have a large overhead (much more than actual computation time). Also, I am short of memory at the driver when it has to

Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
What are some efficient ways to read a large file into RDDs? For example, have several executors read a specific/unique portion of the file and construct RDDs. Is this possible to do in Spark? Currently, I am doing a line-by-line read of the file at the driver and constructing the RDD.

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Jeremy Freeman
If it’s a flat binary file and each record is the same length (in bytes), you can use Spark’s binaryRecords method (defined on the SparkContext), which loads records from one or more large flat binary files into an RDD. Here’s an example in python to show how it works: # write data from an

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
The file has a specific structure. I outline it below. The input file is basically a representation of a graph. INT INT(A) LONG (B) A INTs(Degrees) A SHORTINTs (Vertex_Attribute) B INTs B INTs B SHORTINTs B SHORTINTs A - number of vertices B - number of edges