This sounds a bad idea, given hdfs does not work well with small files. On Sun, Jul 31, 2016 at 8:57 AM, Bhaarat Sharma <bhaara...@gmail.com> wrote:
> I am reading bunch of files in PySpark using binaryFiles. Then I want to > get the number of bytes for each file and write this number to an HDFS file > with the corresponding name. > > Example: > > if directory /myimages has one.jpg, two.jpg, and three.jpg then I want > three files one-success.jpg, two-success.jpg, and three-success.jpg in HDFS > with a number in each. The number will specify the length of bytes. > > Here is what I've done thus far: > > from pyspark import SparkContext > import numpy as np > > sc = SparkContext("local", "test") > > def bytes_length(rawdata): > length = len(np.asarray(bytearray(rawdata),dtype=np.uint8)) > return length > > images = sc.binaryFiles("/root/sift_images_test/*.jpg") > images.map(lambda(filename, contents): > bytes_length(contents)).saveAsTextFile("hdfs://localhost:9000/tmp/somfile") > > > However, doing this creates a single file in HDFS: > > $ hadoop fs -cat /tmp/somfile/part-00000 > > 113212 > 144926 > 178923 > > Instead I want /tmp/somefile in HDFS to have three files: > > one-success.txt with value 113212 > two-success.txt with value 144926 > three-success.txt with value 178923 > > Is it possible to achieve what I'm after? I don't want to write files to > local file system and them put them in HDFS. Instead, I want to use the > saveAsTextFile method on the RDD directly. > > > -- Best Regards, Ayan Guha