Hi all,

     I'm playing around with manipulating images via Python and want to
utilize Spark for scalability. That said, I'm just learing Spark and my
Python is a bit rusty (been doing PHP coding for the last few years). I
think I have most of the process figured out. However, the script fails on
larger images and Spark is sending out the following warning for smaller
images:

Stage 0 contains a task of very large size (1151 KB). The maximum
recommended task size is 100 KB.

My code is as follows:

import Image
from pyspark import SparkContext

if __name__ == "__main__":

    imageFile = "sample.jpg"
    outFile   = "sample.gray.jpg"

    sc = SparkContext(appName="Grayscale")
    im = Image.open(imageFile)

    # Create an RDD for the data from the image file
    img_data = sc.parallelize( list(im.getdata()) )

    # Create an RDD for the grayscale value
    gValue = img_data.map( lambda x: int(x[0]*0.21 + x[1]*0.72 + x[2]*0.07)
)

    # Put our grayscale value into the RGR channels
    grayscale = gValue.map( lambda x: (x,x,x)  )

    # Save the output in a new image.
    im.putdata( grayscale.collect() )

    im.save(outFile)

Obviously, something is amiss. However, I can't figure out where I'm off
track with this. Any help is appreciated! Thanks in advance!!!

Reply via email to