[SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu
Hi, I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup: - TPCDS dataset with scale factor 100 (size 100GB). - Spark, Drill, Presto have a same number of workers: 12. - Each worked has same allocated amount of memory: 4GB. - Da

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Jörn Franke
I don’t think select * is a good benchmark. You should do a more complex operation, otherwise optimizes might see that you don’t do anything in the query and immediately return (similarly count might immediately return by using some statistics). > On 29. Mar 2018, at 02:03, Tin Vu wrote: > >

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu
Thanks for your response. What do you mean when you said "immediately return"? On Wed, Mar 28, 2018, 10:33 PM Jörn Franke wrote: > I don’t think select * is a good benchmark. You should do a more complex > operation, otherwise optimizes might see that you don’t do anything in the > query and im

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh
UI. From: Tin Vu Date: Wednesday, March 28, 2018 at 8:04 PM To: "user@spark.apache.org" Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto Hi, I am executing a benchmark to compare performance of SparkSQL, Apache Drill and

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Tin Vu
;user@spark.apache.org" > *Subject: *[SparkSQL] SparkSQL performance on small TPCDS tables is very > low when compared to Drill or Presto > > > > Hi, > > > > I am executing a benchmark to compare performance of SparkSQL, Apache > Drill and Presto. My experimental setup

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh
u>> Date: Wednesday, March 28, 2018 at 8:04 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto Hi, I am executing a benc

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-31 Thread Gourav Sengupta
Hi Tin, This sounds interesting. While I would prefer to think that Presto and Drill have can you please provide the following details: 1. SPARK version 2. The exact code used in SPARK (the full code that was used) 3. HADOOP version I do think that SPARK and DRILL have complementary and differen

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-31 Thread Tin Vu
Hi Gaurav, Thank you for your response. This is the answer for your questions: 1. Spark 2.3.0 2. I was using 'spark-sql' command, for example: 'spark-sql --master spark:/*:7077 --database tpcds_bin_partitioned_orc_100 -f $file_name' wih file_name is the file that contains SQL script ("select * fro