Hi Gaurav,

Thank you for your response. This is the answer for your questions:
1. Spark 2.3.0
2. I was using 'spark-sql' command, for example: 'spark-sql --master
spark:/*:7077 --database tpcds_bin_partitioned_orc_100 -f $file_name' wih
file_name is the file that contains SQL script ("select * from table_name").
3. Hadoop 2.9.0

I am using JDBS connector to Drill from Hive Metastore. SparkSQL is also
connecting to ORC database by Hive.

Thanks so much!

Tin

On Sat, Mar 31, 2018 at 11:41 AM, Gourav Sengupta <gourav.sengu...@gmail.com
> wrote:

> Hi Tin,
>
> This sounds interesting. While I would prefer to think that Presto and
> Drill have
>
> can you please provide the following details:
> 1. SPARK version
> 2. The exact code used in SPARK (the full code that was used)
> 3. HADOOP version
>
> I do think that SPARK and DRILL have complementary and different used
> cases. Have you tried using JDBC connector to Drill from within SPARKSQL?
>
> Regards,
> Gourav Sengupta
>
>
> On Thu, Mar 29, 2018 at 1:03 AM, Tin Vu <tvu...@ucr.edu> wrote:
>
>> Hi,
>>
>> I am executing a benchmark to compare performance of SparkSQL, Apache
>> Drill and Presto. My experimental setup:
>>
>>    - TPCDS dataset with scale factor 100 (size 100GB).
>>    - Spark, Drill, Presto have a same number of workers: 12.
>>    - Each worked has same allocated amount of memory: 4GB.
>>    - Data is stored by Hive with ORC format.
>>
>> I executed a very simple SQL query: "SELECT * from table_name"
>> The issue is that for some small size tables (even table with few dozen
>> of records), SparkSQL still required about 7-8 seconds to finish, while
>> Drill and Presto only needed less than 1 second.
>> For other large tables with billions records, SparkSQL performance was
>> reasonable when it required 20-30 seconds to scan the whole table.
>> Do you have any idea or reasonable explanation for this issue?
>>
>> Thanks,
>>
>>
>

Reply via email to