You could write each image to a different directory instead of a different
file. That can be done by filtering the RDD into one RDD for each image and
then saving each. That might not be what you’re after though, in terms of space
and speed efficiency. Another way would be to save them multiple
I am just trying to do this as a proof of concept. The actual content of
the files will be quite bit.
I'm having problem using foreach or something similar on an RDD.
sc.binaryFiles("/root/sift_images_test/*.jpg")
returns
("filename1", bytes)
("filname2",bytes)
I'm wondering if there is a do
This sounds a bad idea, given hdfs does not work well with small files.
On Sun, Jul 31, 2016 at 8:57 AM, Bhaarat Sharma wrote:
> I am reading bunch of files in PySpark using binaryFiles. Then I want to
> get the number of bytes for each file and write this number to an HDFS file
> with the corre
I am reading bunch of files in PySpark using binaryFiles. Then I want to
get the number of bytes for each file and write this number to an HDFS file
with the corresponding name.
Example:
if directory /myimages has one.jpg, two.jpg, and three.jpg then I want
three files one-success.jpg, two-succes