I am reading bunch of files in PySpark using binaryFiles. Then I want to get the number of bytes for each file and write this number to an HDFS file with the corresponding name.
Example: if directory /myimages has one.jpg, two.jpg, and three.jpg then I want three files one-success.jpg, two-success.jpg, and three-success.jpg in HDFS with a number in each. The number will specify the length of bytes. Here is what I've done thus far: from pyspark import SparkContext import numpy as np sc = SparkContext("local", "test") def bytes_length(rawdata): length = len(np.asarray(bytearray(rawdata),dtype=np.uint8)) return length images = sc.binaryFiles("/root/sift_images_test/*.jpg") images.map(lambda(filename, contents): bytes_length(contents)).saveAsTextFile("hdfs://localhost:9000/tmp/somfile") However, doing this creates a single file in HDFS: $ hadoop fs -cat /tmp/somfile/part-00000 113212 144926 178923 Instead I want /tmp/somefile in HDFS to have three files: one-success.txt with value 113212 two-success.txt with value 144926 three-success.txt with value 178923 Is it possible to achieve what I'm after? I don't want to write files to local file system and them put them in HDFS. Instead, I want to use the saveAsTextFile method on the RDD directly.