Hi,

We have tried to use spark sql to process some gzipped json-format log
files stored on S3 or HDFS. But the performance is very poor.

For example, here is the code that I run over 20 gzipped files (total size
of them is 4GB compressed and ~40GB when decompressed)

gzfile = 's3n://my-logs-bucket/*.gz' # or 'hdfs://nameservice1/tmp/*.gz'
sc_file = sc.textFile(gzfile)
sc_file.cache()
df = sqlContext.jsonRDD(sc_file)
df.select('*').limit(1).show()

With 6 executors launched, each with 2 cpu cores and 5GB RAM, the
"df.select" operation would always take more than 150 secs to finish,
regardless of whether the files are stored on s3 or HDFS.

BTW we are running spark 1.6.1 on mesos, with fine-grained mode.

Downloading from s3 is fast. In another test within the same environment,
it takes no more than 2 minutes to finish a simple "sc_file.count()" over
500 similar files whose total size is 15GB when compressed, and 400GB when
decompressed.

I thought the bottleneck might be in the json schema auto-inference.
However, I have tried specify the schema explicitly instead of letting
spark infer it, but that makes no notable difference.

Things I plan to try soon:

* Decompress the gz files and save it to HDFS, construct a data frame on
decompressed files, then run sql over it.
* Or save the json files into parquet format on HDFS, and then run sql over
it.

Do you have any suggestions? Thanks!

Regards,
Shuai

Reply via email to