I tried repartitions but spark.sql.shuffle.partitions is taking up precedence over repartitions or coalesce. how to get the lesser number of files with same performance?
On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara < tushar_adesh...@persistent.com> wrote: > You can also try coalesce as it will avoid full shuffle. > > > Regards, > > *Tushar Adeshara* > > *Technical Specialist – Analytics Practice* > > *Cell: +91-81490 04192 <+91%2081490%2004192>* > > *Persistent Systems** Ltd. **| **Partners in Innovation **|* > *www.persistentsys.com > <http://www.persistentsys.com/>* > > > ------------------------------ > *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > *Sent:* 13 October 2017 09:35 > *To:* user @spark > *Subject:* Spark - Partitions > > Hi, > > I am reading hive query and wiriting the data back into hive after doing > some transformations. > > I have changed setting spark.sql.shuffle.partitions to 2000 and since then > job completes fast but the main problem is I am getting 2000 files for each > partition > size of file is 10 MB . > > is there a way to get same performance but write lesser number of files ? > > I am trying repartition now but would like to know if there are any other > options. > > Thanks, > Asmath > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which is > the property of Persistent Systems Ltd. It is intended only for the use of > the individual or entity to which it is addressed. If you are not the > intended recipient, you are not authorized to read, retain, copy, print, > distribute or use this message. If you have received this communication in > error, please notify the sender and delete all copies of this message. > Persistent Systems Ltd. does not accept any liability for virus infected > mails. >