Hello,

I'm interested in getting started with Spark to scale our scientific analysis 
package (http://pynbody.github.io) to larger data sets. The package is written 
in Python and makes heavy use of numpy/scipy and related frameworks. I've got a 
couple of questions that I have not been able to find easy answers to despite 
some research efforts... I hope someone here can clarify things for me a bit!

* is there a preferred way to read binary data off a local disk directly into 
an RDD? Our I/O routines are built to read data in chunks and each chunk could 
be read by a different process/RDD, but it's not clear to me how to accomplish 
this with the existing API. Since the idea is to process data sets that don't 
fit into a single node's memory, reading first and then distributing via 
sc.parallelize is obviously not an option. 

* related to the first question -- when an RDD is created by parallelizing a 
numpy array, the array gets serialized and distributed. I see in the source 
that it actually gets written into a file first (!?) -- but surely the Py4J 
bottleneck for python array types (mentioned in the source comment) doesn't 
really apply to numpy arrays? Is it really necessary to dump the data onto disk 
first? Conversely, the collect() seems really slow and I suspect that this is 
due to the combination of disk I/O and python list creation. Are there any ways 
of getting around this if numpy arrays are being used? 


I'd be curious about any other best-practices tips anyone might have for 
running pyspark with numpy data...! 

Thanks!


Rok

Reply via email to