Re: Spark performance for small queries

Gopal V Thu, 22 Jan 2015 15:32:16 -0800

On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote:

We were comparing performance of some of our production hive queries
between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
Spark 0.9 and 1.1. We could see that the performance gains have been good
in Spark.

Is there any particular reason you are using an ancient & slowHadoop-1.x version instead of a modern YARN 2.0 cluster?

We tried a very simple query,
select count(*) from T where col3=123
in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
performance had been 2x better than Hive (120sec vs 60sec). Table T is
stored in S3 and contains 600MB single GZIP file.

Not sure if you understand that what you're doing is one of the worstcases for both the platforms.


Using a big single gzip file is like a massive anti-pattern.

I'm assuming what you want is fast SQL in Hive (since this is the hivelist) along with all the other lead/lag functions there.

You need a SQL oriented columnar format like ORC, mix with YARN and addTez, that is going to be somewhere near 10-12 seconds.


Oh, and that's a ball-park figure for a single node.

Cheers,
Gopal

Re: Spark performance for small queries

Reply via email to