Hi all, I'm playing around with manipulating images via Python and want to utilize Spark for scalability. That said, I'm just learing Spark and my Python is a bit rusty (been doing PHP coding for the last few years). I think I have most of the process figured out. However, the script fails on larger images and Spark is sending out the following warning for smaller images:
Stage 0 contains a task of very large size (1151 KB). The maximum recommended task size is 100 KB. My code is as follows: import Image from pyspark import SparkContext if __name__ == "__main__": imageFile = "sample.jpg" outFile = "sample.gray.jpg" sc = SparkContext(appName="Grayscale") im = Image.open(imageFile) # Create an RDD for the data from the image file img_data = sc.parallelize( list(im.getdata()) ) # Create an RDD for the grayscale value gValue = img_data.map( lambda x: int(x[0]*0.21 + x[1]*0.72 + x[2]*0.07) ) # Put our grayscale value into the RGR channels grayscale = gValue.map( lambda x: (x,x,x) ) # Save the output in a new image. im.putdata( grayscale.collect() ) im.save(outFile) Obviously, something is amiss. However, I can't figure out where I'm off track with this. Any help is appreciated! Thanks in advance!!!