Hi,

I have program that I have for single computer (in Python) exection and also 
implemented the same for Spark. This program basically only reads .json from 
which it takes one field and saves it back. Using Spark my program runs 
aproximately 100 times slower on 1 master and 1 slave. So I would like to ask 
where possibly might be the problem?

My Spark program looks like:
 
sc = SparkContext(appName="Json data preprocessor")
distData = sc.textFile(sys.argv[2])
json_extractor = JsonExtractor(sys.argv[1])
cleanedData = distData.flatMap(json_extractor.extract_json) 
cleanedData.saveAsTextFile(sys.argv[3])

JsonExtractor only selects the data from field that is given by sys.argv[1].
 
My data are basically many small one json files, where is one json per line.

I have tried both, reading and writing the data from/to Amazon S3, local disc 
on all the machines.

I would like to ask if there is something that I am missing or if Spark is 
supposed to be so slow in comparison with the local non parallelized single 
node program. 
 
Thank you in advance for any suggestions or hints.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to