Hi Gaurav, Thank you for your response. This is the answer for your questions: 1. Spark 2.3.0 2. I was using 'spark-sql' command, for example: 'spark-sql --master spark:/*:7077 --database tpcds_bin_partitioned_orc_100 -f $file_name' wih file_name is the file that contains SQL script ("select * from table_name"). 3. Hadoop 2.9.0
I am using JDBS connector to Drill from Hive Metastore. SparkSQL is also connecting to ORC database by Hive. Thanks so much! Tin On Sat, Mar 31, 2018 at 11:41 AM, Gourav Sengupta <gourav.sengu...@gmail.com > wrote: > Hi Tin, > > This sounds interesting. While I would prefer to think that Presto and > Drill have > > can you please provide the following details: > 1. SPARK version > 2. The exact code used in SPARK (the full code that was used) > 3. HADOOP version > > I do think that SPARK and DRILL have complementary and different used > cases. Have you tried using JDBC connector to Drill from within SPARKSQL? > > Regards, > Gourav Sengupta > > > On Thu, Mar 29, 2018 at 1:03 AM, Tin Vu <tvu...@ucr.edu> wrote: > >> Hi, >> >> I am executing a benchmark to compare performance of SparkSQL, Apache >> Drill and Presto. My experimental setup: >> >> - TPCDS dataset with scale factor 100 (size 100GB). >> - Spark, Drill, Presto have a same number of workers: 12. >> - Each worked has same allocated amount of memory: 4GB. >> - Data is stored by Hive with ORC format. >> >> I executed a very simple SQL query: "SELECT * from table_name" >> The issue is that for some small size tables (even table with few dozen >> of records), SparkSQL still required about 7-8 seconds to finish, while >> Drill and Presto only needed less than 1 second. >> For other large tables with billions records, SparkSQL performance was >> reasonable when it required 20-30 seconds to scan the whole table. >> Do you have any idea or reasonable explanation for this issue? >> >> Thanks, >> >> >