Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Gourav Sengupta Sat, 31 Mar 2018 11:41:50 -0700

Hi Tin,

This sounds interesting. While I would prefer to think that Presto and
Drill have


can you please provide the following details:
1. SPARK version
2. The exact code used in SPARK (the full code that was used)
3. HADOOP version

I do think that SPARK and DRILL have complementary and different used
cases. Have you tried using JDBC connector to Drill from within SPARKSQL?

Regards,
Gourav Sengupta


On Thu, Mar 29, 2018 at 1:03 AM, Tin Vu <tvu...@ucr.edu> wrote:

> Hi,
>
> I am executing a benchmark to compare performance of SparkSQL, Apache
> Drill and Presto. My experimental setup:
>
>    - TPCDS dataset with scale factor 100 (size 100GB).
>    - Spark, Drill, Presto have a same number of workers: 12.
>    - Each worked has same allocated amount of memory: 4GB.
>    - Data is stored by Hive with ORC format.
>
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
>
> Thanks,
>
>

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Reply via email to