Re: Reading a large file (binary) into RDD

2015-04-03 Thread Dean Wampler
This might be overkill for your needs, but the scodec parser combinator
library might be useful for creating a parser.

https://github.com/scodec/scodec

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 6:53 PM, java8964 java8...@hotmail.com wrote:

 I think implementing your own InputFormat and using
 SparkContext.hadoopFile() is the best option for your case.

 Yong

 --
 From: kvi...@vt.edu
 Date: Thu, 2 Apr 2015 17:31:30 -0400
 Subject: Re: Reading a large file (binary) into RDD
 To: freeman.jer...@gmail.com
 CC: user@spark.apache.org


 The file has a specific structure. I outline it below.

 The input file is basically a representation of a graph.

 INT
 INT(A)
 LONG (B)
 A INTs(Degrees)
 A SHORTINTs  (Vertex_Attribute)
 B INTs
 B INTs
 B SHORTINTs
 B SHORTINTs

 A - number of vertices
 B - number of edges (note that the INTs/SHORTINTs associated with this are
 edge attributes)

 After reading in the file, I need to create two RDDs (one with vertices
 and the other with edges)

 On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman freeman.jer...@gmail.com
 wrote:

 Hm, that will indeed be trickier because this method assumes records are
 the same byte size. Is the file an arbitrary sequence of mixed types, or is
 there structure, e.g. short, long, short, long, etc.?

 If you could post a gist with an example of the kind of file and how it
 should look once read in that would be useful!

 -
 jeremyfreeman.net
 @thefreemanlab

 On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 Thanks for the reply. Unfortunately, in my case, the binary file is a mix
 of short and long integers. Is there any other way that could of use here?

 My current method happens to have a large overhead (much more than actual
 computation time). Also, I am short of memory at the driver when it has to
 read the entire file.

 On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com
 wrote:

 If it’s a flat binary file and each record is the same length (in bytes),
 you can use Spark’s binaryRecords method (defined on the SparkContext),
 which loads records from one or more large flat binary files into an RDD.
 Here’s an example in python to show how it works:

 # write data from an array
 from numpy import random
 dat = random.randn(100,5)
 f = open('test.bin', 'w')
 f.write(dat)
 f.close()


 # load the data back in

 from numpy import frombuffer

 nrecords = 5
 bytesize = 8
 recordsize = nrecords * bytesize
 data = sc.binaryRecords('test.bin', recordsize)
 parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))


 # these should be equal
 parsed.first()
 dat[0,:]


 Does that help?

 -
 jeremyfreeman.net
 @thefreemanlab

 On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 What are some efficient ways to read a large file into RDDs?

 For example, have several executors read a specific/unique portion of the
 file and construct RDDs. Is this possible to do in Spark?

 Currently, I am doing a line-by-line read of the file at the driver and
 constructing the RDD.








Re: Reading a large file (binary) into RDD

2015-04-03 Thread Vijayasarathy Kannan
Thanks everyone for the inputs.

I guess I will try out a custom implementation of InputFormat. But I have
no idea where to start. Are there any code examples of this that might help?

On Fri, Apr 3, 2015 at 9:15 AM, Dean Wampler deanwamp...@gmail.com wrote:

 This might be overkill for your needs, but the scodec parser combinator
 library might be useful for creating a parser.

 https://github.com/scodec/scodec

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 2, 2015 at 6:53 PM, java8964 java8...@hotmail.com wrote:

 I think implementing your own InputFormat and using
 SparkContext.hadoopFile() is the best option for your case.

 Yong

 --
 From: kvi...@vt.edu
 Date: Thu, 2 Apr 2015 17:31:30 -0400
 Subject: Re: Reading a large file (binary) into RDD
 To: freeman.jer...@gmail.com
 CC: user@spark.apache.org


 The file has a specific structure. I outline it below.

 The input file is basically a representation of a graph.

 INT
 INT(A)
 LONG (B)
 A INTs(Degrees)
 A SHORTINTs  (Vertex_Attribute)
 B INTs
 B INTs
 B SHORTINTs
 B SHORTINTs

 A - number of vertices
 B - number of edges (note that the INTs/SHORTINTs associated with this
 are edge attributes)

 After reading in the file, I need to create two RDDs (one with vertices
 and the other with edges)

 On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman freeman.jer...@gmail.com
 wrote:

 Hm, that will indeed be trickier because this method assumes records are
 the same byte size. Is the file an arbitrary sequence of mixed types, or is
 there structure, e.g. short, long, short, long, etc.?

 If you could post a gist with an example of the kind of file and how it
 should look once read in that would be useful!

 -
 jeremyfreeman.net
 @thefreemanlab

 On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 Thanks for the reply. Unfortunately, in my case, the binary file is a mix
 of short and long integers. Is there any other way that could of use here?

 My current method happens to have a large overhead (much more than actual
 computation time). Also, I am short of memory at the driver when it has to
 read the entire file.

 On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com
 wrote:

 If it’s a flat binary file and each record is the same length (in bytes),
 you can use Spark’s binaryRecords method (defined on the SparkContext),
 which loads records from one or more large flat binary files into an RDD.
 Here’s an example in python to show how it works:

 # write data from an array
 from numpy import random
 dat = random.randn(100,5)
 f = open('test.bin', 'w')
 f.write(dat)
 f.close()


 # load the data back in

 from numpy import frombuffer

 nrecords = 5
 bytesize = 8
 recordsize = nrecords * bytesize
 data = sc.binaryRecords('test.bin', recordsize)
 parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))


 # these should be equal
 parsed.first()
 dat[0,:]


 Does that help?

 -
 jeremyfreeman.net
 @thefreemanlab

 On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 What are some efficient ways to read a large file into RDDs?

 For example, have several executors read a specific/unique portion of the
 file and construct RDDs. Is this possible to do in Spark?

 Currently, I am doing a line-by-line read of the file at the driver and
 constructing the RDD.









RE: Reading a large file (binary) into RDD

2015-04-03 Thread java8964
Hadoop TextInputFormat is a good start.
It is not really that hard. You just need to implement the logic to identify 
the Record delimiter, and think a logic way to represent the Key, Value for 
your RecordReader.
Yong

From: kvi...@vt.edu
Date: Fri, 3 Apr 2015 11:41:13 -0400
Subject: Re: Reading a large file (binary) into RDD
To: deanwamp...@gmail.com
CC: java8...@hotmail.com; user@spark.apache.org

Thanks everyone for the inputs.
I guess I will try out a custom implementation of InputFormat. But I have no 
idea where to start. Are there any code examples of this that might help?
On Fri, Apr 3, 2015 at 9:15 AM, Dean Wampler deanwamp...@gmail.com wrote:
This might be overkill for your needs, but the scodec parser combinator library 
might be useful for creating a parser.
https://github.com/scodec/scodec
Dean Wampler, Ph.D.Author: Programming Scala, 2nd Edition (O'Reilly)
Typesafe
@deanwamplerhttp://polyglotprogramming.com

On Thu, Apr 2, 2015 at 6:53 PM, java8964 java8...@hotmail.com wrote:



I think implementing your own InputFormat and using SparkContext.hadoopFile() 
is the best option for your case.
Yong

From: kvi...@vt.edu
Date: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org

The file has a specific structure. I outline it below.
The input file is basically a representation of a graph.

INTINT(A)LONG (B)A INTs(Degrees)A SHORTINTs  
(Vertex_Attribute)B INTsB INTsB SHORTINTsB SHORTINTs

A - number of verticesB - number of edges (note that the INTs/SHORTINTs 
associated with this are edge attributes)
After reading in the file, I need to create two RDDs (one with vertices and the 
other with edges)
On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Hm, that will indeed be trickier because this method assumes records are the 
same byte size. Is the file an arbitrary sequence of mixed types, or is there 
structure, e.g. short, long, short, long, etc.? 
If you could post a gist with an example of the kind of file and how it should 
look once read in that would be useful!


-
jeremyfreeman.net
@thefreemanlab



On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
Thanks for the reply. Unfortunately, in my case, the binary file is a mix of 
short and long integers. Is there any other way that could of use here?
My current method happens to have a large overhead (much more than actual 
computation time). Also, I am short of memory at the driver when it has to read 
the entire file.
On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
If it’s a flat binary file and each record is the same length (in bytes), you 
can use Spark’s binaryRecords method (defined on the SparkContext), which loads 
records from one or more large flat binary files into an RDD. Here’s an example 
in python to show how it works:
# write data from an arrayfrom numpy import randomdat = random.randn(100,5)f = 
open('test.bin', 'w')f.write(dat)f.close()
# load the data back infrom numpy import frombuffernrecords = 5bytesize = 
8recordsize = nrecords * bytesizedata = sc.binaryRecords('test.bin', 
recordsize)parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 
'float'))

# these should be equalparsed.first()dat[0,:]
Does that help?
-
jeremyfreeman.net
@thefreemanlab


On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
What are some efficient ways to read a large file into RDDs?
For example, have several executors read a specific/unique portion of the file 
and construct RDDs. Is this possible to do in Spark?
Currently, I am doing a line-by-line read of the file at the driver and 
constructing the RDD.





  



  

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Jeremy Freeman
Hm, that will indeed be trickier because this method assumes records are the 
same byte size. Is the file an arbitrary sequence of mixed types, or is there 
structure, e.g. short, long, short, long, etc.? 

If you could post a gist with an example of the kind of file and how it should 
look once read in that would be useful!

-
jeremyfreeman.net
@thefreemanlab

On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 Thanks for the reply. Unfortunately, in my case, the binary file is a mix of 
 short and long integers. Is there any other way that could of use here?
 
 My current method happens to have a large overhead (much more than actual 
 computation time). Also, I am short of memory at the driver when it has to 
 read the entire file.
 
 On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com 
 wrote:
 If it’s a flat binary file and each record is the same length (in bytes), you 
 can use Spark’s binaryRecords method (defined on the SparkContext), which 
 loads records from one or more large flat binary files into an RDD. Here’s an 
 example in python to show how it works:
 
 # write data from an array
 from numpy import random
 dat = random.randn(100,5)
 f = open('test.bin', 'w')
 f.write(dat)
 f.close()
 
 # load the data back in
 from numpy import frombuffer
 nrecords = 5
 bytesize = 8
 recordsize = nrecords * bytesize
 data = sc.binaryRecords('test.bin', recordsize)
 parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))
 
 # these should be equal
 parsed.first()
 dat[0,:]
 
 
 Does that help?
 
 -
 jeremyfreeman.net
 @thefreemanlab
 
 On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
 
 What are some efficient ways to read a large file into RDDs?
 
 For example, have several executors read a specific/unique portion of the 
 file and construct RDDs. Is this possible to do in Spark?
 
 Currently, I am doing a line-by-line read of the file at the driver and 
 constructing the RDD.
 
 



RE: Reading a large file (binary) into RDD

2015-04-02 Thread java8964
I think implementing your own InputFormat and using SparkContext.hadoopFile() 
is the best option for your case.
Yong

From: kvi...@vt.edu
Date: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org

The file has a specific structure. I outline it below.
The input file is basically a representation of a graph.

INTINT(A)LONG (B)A INTs(Degrees)A SHORTINTs  
(Vertex_Attribute)B INTsB INTsB SHORTINTsB SHORTINTs

A - number of verticesB - number of edges (note that the INTs/SHORTINTs 
associated with this are edge attributes)
After reading in the file, I need to create two RDDs (one with vertices and the 
other with edges)
On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Hm, that will indeed be trickier because this method assumes records are the 
same byte size. Is the file an arbitrary sequence of mixed types, or is there 
structure, e.g. short, long, short, long, etc.? 
If you could post a gist with an example of the kind of file and how it should 
look once read in that would be useful!


-
jeremyfreeman.net
@thefreemanlab



On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
Thanks for the reply. Unfortunately, in my case, the binary file is a mix of 
short and long integers. Is there any other way that could of use here?
My current method happens to have a large overhead (much more than actual 
computation time). Also, I am short of memory at the driver when it has to read 
the entire file.
On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
If it’s a flat binary file and each record is the same length (in bytes), you 
can use Spark’s binaryRecords method (defined on the SparkContext), which loads 
records from one or more large flat binary files into an RDD. Here’s an example 
in python to show how it works:
# write data from an arrayfrom numpy import randomdat = random.randn(100,5)f = 
open('test.bin', 'w')f.write(dat)f.close()
# load the data back infrom numpy import frombuffernrecords = 5bytesize = 
8recordsize = nrecords * bytesizedata = sc.binaryRecords('test.bin', 
recordsize)parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 
'float'))

# these should be equalparsed.first()dat[0,:]
Does that help?
-
jeremyfreeman.net
@thefreemanlab


On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
What are some efficient ways to read a large file into RDDs?
For example, have several executors read a specific/unique portion of the file 
and construct RDDs. Is this possible to do in Spark?
Currently, I am doing a line-by-line read of the file at the driver and 
constructing the RDD.





  

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
Thanks for the reply. Unfortunately, in my case, the binary file is a mix
of short and long integers. Is there any other way that could of use here?

My current method happens to have a large overhead (much more than actual
computation time). Also, I am short of memory at the driver when it has to
read the entire file.

On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com
wrote:

 If it’s a flat binary file and each record is the same length (in bytes),
 you can use Spark’s binaryRecords method (defined on the SparkContext),
 which loads records from one or more large flat binary files into an RDD.
 Here’s an example in python to show how it works:

 # write data from an array
 from numpy import random
 dat = random.randn(100,5)
 f = open('test.bin', 'w')
 f.write(dat)
 f.close()


 # load the data back in

 from numpy import frombuffer

 nrecords = 5
 bytesize = 8
 recordsize = nrecords * bytesize
 data = sc.binaryRecords('test.bin', recordsize)
 parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))


 # these should be equal
 parsed.first()
 dat[0,:]


 Does that help?

 -
 jeremyfreeman.net
 @thefreemanlab

 On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 What are some efficient ways to read a large file into RDDs?

 For example, have several executors read a specific/unique portion of the
 file and construct RDDs. Is this possible to do in Spark?

 Currently, I am doing a line-by-line read of the file at the driver and
 constructing the RDD.





Re: Reading a large file (binary) into RDD

2015-04-02 Thread Jeremy Freeman
If it’s a flat binary file and each record is the same length (in bytes), you 
can use Spark’s binaryRecords method (defined on the SparkContext), which loads 
records from one or more large flat binary files into an RDD. Here’s an example 
in python to show how it works:

 # write data from an array
 from numpy import random
 dat = random.randn(100,5)
 f = open('test.bin', 'w')
 f.write(dat)
 f.close()

 # load the data back in
 from numpy import frombuffer
 nrecords = 5
 bytesize = 8
 recordsize = nrecords * bytesize
 data = sc.binaryRecords('test.bin', recordsize)
 parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))

 # these should be equal
 parsed.first()
 dat[0,:]


Does that help?

-
jeremyfreeman.net
@thefreemanlab

 On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
 
 What are some efficient ways to read a large file into RDDs?
 
 For example, have several executors read a specific/unique portion of the 
 file and construct RDDs. Is this possible to do in Spark?
 
 Currently, I am doing a line-by-line read of the file at the driver and 
 constructing the RDD.



Re: Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
The file has a specific structure. I outline it below.

The input file is basically a representation of a graph.

INT
INT(A)
LONG (B)
A INTs(Degrees)
A SHORTINTs  (Vertex_Attribute)
B INTs
B INTs
B SHORTINTs
B SHORTINTs

A - number of vertices
B - number of edges (note that the INTs/SHORTINTs associated with this are
edge attributes)

After reading in the file, I need to create two RDDs (one with vertices and
the other with edges)

On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman freeman.jer...@gmail.com
wrote:

 Hm, that will indeed be trickier because this method assumes records are
 the same byte size. Is the file an arbitrary sequence of mixed types, or is
 there structure, e.g. short, long, short, long, etc.?

 If you could post a gist with an example of the kind of file and how it
 should look once read in that would be useful!

 -
 jeremyfreeman.net
 @thefreemanlab

 On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 Thanks for the reply. Unfortunately, in my case, the binary file is a mix
 of short and long integers. Is there any other way that could of use here?

 My current method happens to have a large overhead (much more than actual
 computation time). Also, I am short of memory at the driver when it has to
 read the entire file.

 On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com
 wrote:

 If it’s a flat binary file and each record is the same length (in bytes),
 you can use Spark’s binaryRecords method (defined on the SparkContext),
 which loads records from one or more large flat binary files into an RDD.
 Here’s an example in python to show how it works:

 # write data from an array
 from numpy import random
 dat = random.randn(100,5)
 f = open('test.bin', 'w')
 f.write(dat)
 f.close()


 # load the data back in

 from numpy import frombuffer

 nrecords = 5
 bytesize = 8
 recordsize = nrecords * bytesize
 data = sc.binaryRecords('test.bin', recordsize)
 parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))


 # these should be equal
 parsed.first()
 dat[0,:]


 Does that help?

 -
 jeremyfreeman.net
 @thefreemanlab

 On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 What are some efficient ways to read a large file into RDDs?

 For example, have several executors read a specific/unique portion of the
 file and construct RDDs. Is this possible to do in Spark?

 Currently, I am doing a line-by-line read of the file at the driver and
 constructing the RDD.