If you need to reduce the number of partitions you could also try df.coalesce

---- On Thu, 04 Apr 2019 06:52:26 -0700 jasonnerot...@gmail.com wrote ----

Have you tried something like this?

spark.conf.set("spark.sql.shuffle.partitions", "5" ) 



On Wed, Apr 3, 2019 at 8:37 PM Arthur Li <arthur...@flipp.com> wrote:
Hi Sparkers,

I noticed that in my spark application, the number of tasks in the first stage 
is equal to the number of files read by the application(at least for Avro) if 
the number of cpu cores is less than the number of files. Though If cpu cores 
are more than number of files, it's usually equal to default parallelism 
number. Why is it behave like this? Would this require a lot of resource from 
the driver? Is there any way we can do to decrease the number of 
tasks(partitions) in the first stage without merge files before loading? 

Thanks,
Arthur 


IMPORTANT NOTICE:  This message, including any attachments (hereinafter 
collectively referred to as "Communication"), is intended only for the 
addressee(s) named above.  This Communication may include information that is 
privileged, confidential and exempt from disclosure under applicable law.  If 
the recipient of this Communication is not the intended recipient, or the 
employee or agent responsible for delivering this Communication to the intended 
recipient, you are notified that any dissemination, distribution or copying of 
this Communication is strictly prohibited.  If you have received this 
Communication in error, please notify the sender immediately by phone or email 
and permanently delete this Communication from your computer without making a 
copy. Thank you.


-- 
Thanks,
Jason

Reply via email to