How many files do you have and how big is each JSON object? Spark works better with a few big files vs many smaller ones. So you could try cat'ing your files together and rerunning the same experiment.
- Evan > On Oct 18, 2014, at 12:07 PM, <jan.zi...@centrum.cz> <jan.zi...@centrum.cz> > wrote: > > Hi, > > I have program that I have for single computer (in Python) exection and also > implemented the same for Spark. This program basically only reads .json from > which it takes one field and saves it back. Using Spark my program runs > aproximately 100 times slower on 1 master and 1 slave. So I would like to ask > where possibly might be the problem? > > My Spark program looks like: > > sc = SparkContext(appName="Json data preprocessor") > distData = sc.textFile(sys.argv[2]) > json_extractor = JsonExtractor(sys.argv[1]) > cleanedData = distData.flatMap(json_extractor.extract_json) > cleanedData.saveAsTextFile(sys.argv[3]) > > JsonExtractor only selects the data from field that is given by sys.argv[1]. > > My data are basically many small one json files, where is one json per line. > > I have tried both, reading and writing the data from/to Amazon S3, local disc > on all the machines. > > I would like to ask if there is something that I am missing or if Spark is > supposed to be so slow in comparison with the local non parallelized single > node program. > > Thank you in advance for any suggestions or hints. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org