Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-31 Thread Andrew Ehrlich
You could write each image to a different directory instead of a different file. That can be done by filtering the RDD into one RDD for each image and then saving each. That might not be what you’re after though, in terms of space and speed efficiency. Another way would be to save them multiple

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-30 Thread Bhaarat Sharma
I am just trying to do this as a proof of concept. The actual content of the files will be quite bit. I'm having problem using foreach or something similar on an RDD. sc.binaryFiles("/root/sift_images_test/*.jpg") returns ("filename1", bytes) ("filname2",bytes) I'm wondering if there is a do

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-30 Thread ayan guha
This sounds a bad idea, given hdfs does not work well with small files. On Sun, Jul 31, 2016 at 8:57 AM, Bhaarat Sharma wrote: > I am reading bunch of files in PySpark using binaryFiles. Then I want to > get the number of bytes for each file and write this number to an HDFS

How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-30 Thread Bhaarat Sharma
I am reading bunch of files in PySpark using binaryFiles. Then I want to get the number of bytes for each file and write this number to an HDFS file with the corresponding name. Example: if directory /myimages has one.jpg, two.jpg, and three.jpg then I want three files one-success.jpg,