It depends on how you have loaded data.. Ideally, if you have dozens of 
records, your input data should have them in one partition. If the input has 1 
partition, and data is small enough, Spark will keep it in one partition (as 
far as possible)

If you cannot control your data, you need to repartition the data when you load 
it  This will (eventually) cause a shuffle and all the data will be moved into 
the number of partitions that you specify. Subsequent operations will be on the 
repartitioned dataframe, and should take number of tasks. Shuffle has costs 
assosciated with it. You will need to make a call whether you want to take the 
upfront cost of a shuffle, or you want to live with large number of tasks

From: Tin Vu <tvu...@ucr.edu>
Date: Thursday, March 29, 2018 at 10:47 AM
To: "Lalwani, Jayesh" <jayesh.lalw...@capitalone.com>
Cc: "user@spark.apache.org" <user@spark.apache.org>
Subject: Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low 
when compared to Drill or Presto

 You are right. There are too much tasks was created. How can we reduce the 
number of tasks?

On Thu, Mar 29, 2018, 7:44 AM Lalwani, Jayesh 
<jayesh.lalw...@capitalone.com<mailto:jayesh.lalw...@capitalone.com>> wrote:
Without knowing too many details, I can only guess. It could be that Spark is 
creating a lot of tasks even though there are less records. Creation and 
distribution of tasks has a noticeable overhead on smaller datasets.

You might want to look at the driver logs, or the Spark Application Detail UI.

From: Tin Vu <tvu...@ucr.edu<mailto:tvu...@ucr.edu>>
Date: Wednesday, March 28, 2018 at 8:04 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when 
compared to Drill or Presto

Hi,

I am executing a benchmark to compare performance of SparkSQL, Apache Drill and 
Presto. My experimental setup:
•         TPCDS dataset with scale factor 100 (size 100GB).
•         Spark, Drill, Presto have a same number of workers: 12.
•         Each worked has same allocated amount of memory: 4GB.
•         Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
The issue is that for some small size tables (even table with few dozen of 
records), SparkSQL still required about 7-8 seconds to finish, while Drill and 
Presto only needed less than 1 second.
For other large tables with billions records, SparkSQL performance was 
reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?

Thanks,


________________________________

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.
________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Reply via email to