How many files do you have and how big is each JSON object?

Spark works better with a few big files vs many smaller ones. So you could try 
cat'ing your files together and rerunning the same experiment. 

- Evan


> On Oct 18, 2014, at 12:07 PM, <jan.zi...@centrum.cz> <jan.zi...@centrum.cz> 
> wrote:
> 
> Hi,
> 
> I have program that I have for single computer (in Python) exection and also 
> implemented the same for Spark. This program basically only reads .json from 
> which it takes one field and saves it back. Using Spark my program runs 
> aproximately 100 times slower on 1 master and 1 slave. So I would like to ask 
> where possibly might be the problem?
> 
> My Spark program looks like:
>  
> sc = SparkContext(appName="Json data preprocessor")
> distData = sc.textFile(sys.argv[2])
> json_extractor = JsonExtractor(sys.argv[1])
> cleanedData = distData.flatMap(json_extractor.extract_json) 
> cleanedData.saveAsTextFile(sys.argv[3])
> 
> JsonExtractor only selects the data from field that is given by sys.argv[1].
>  
> My data are basically many small one json files, where is one json per line.
> 
> I have tried both, reading and writing the data from/to Amazon S3, local disc 
> on all the machines.
> 
> I would like to ask if there is something that I am missing or if Spark is 
> supposed to be so slow in comparison with the local non parallelized single 
> node program. 
>  
> Thank you in advance for any suggestions or hints.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to