Have you tried caching it and using a coalesce?
On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com> wrote: > I tried repartitions but spark.sql.shuffle.partitions is taking up > precedence over repartitions or coalesce. how to get the lesser number of > files with same performance? > > On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara < > tushar_adesh...@persistent.com> wrote: > >> You can also try coalesce as it will avoid full shuffle. >> >> >> Regards, >> >> *Tushar Adeshara* >> >> *Technical Specialist – Analytics Practice* >> >> *Cell: +91-81490 04192 <+91%2081490%2004192>* >> >> *Persistent Systems** Ltd. **| **Partners in Innovation **|* >> *www.persistentsys.com >> <http://www.persistentsys.com/>* >> >> >> ------------------------------ >> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> >> *Sent:* 13 October 2017 09:35 >> *To:* user @spark >> *Subject:* Spark - Partitions >> >> Hi, >> >> I am reading hive query and wiriting the data back into hive after doing >> some transformations. >> >> I have changed setting spark.sql.shuffle.partitions to 2000 and since >> then job completes fast but the main problem is I am getting 2000 files for >> each partition >> size of file is 10 MB . >> >> is there a way to get same performance but write lesser number of files ? >> >> I am trying repartition now but would like to know if there are any other >> options. >> >> Thanks, >> Asmath >> DISCLAIMER >> ========== >> This e-mail may contain privileged and confidential information which is >> the property of Persistent Systems Ltd. It is intended only for the use of >> the individual or entity to which it is addressed. If you are not the >> intended recipient, you are not authorized to read, retain, copy, print, >> distribute or use this message. If you have received this communication in >> error, please notify the sender and delete all copies of this message. >> Persistent Systems Ltd. does not accept any liability for virus infected >> mails. >> > >