Have you tried caching it and using a coalesce?


On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com>
wrote:

> I tried repartitions but spark.sql.shuffle.partitions is taking up
> precedence over repartitions or coalesce. how to get the lesser number of
> files with same performance?
>
> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
> tushar_adesh...@persistent.com> wrote:
>
>> You can also try coalesce as it will avoid full shuffle.
>>
>>
>> Regards,
>>
>> *Tushar Adeshara*
>>
>> *Technical Specialist – Analytics Practice*
>>
>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>
>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>> *www.persistentsys.com
>> <http://www.persistentsys.com/>*
>>
>>
>> ------------------------------
>> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>> *Sent:* 13 October 2017 09:35
>> *To:* user @spark
>> *Subject:* Spark - Partitions
>>
>> Hi,
>>
>> I am reading hive query and wiriting the data back into hive after doing
>> some transformations.
>>
>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>> then job completes fast but the main problem is I am getting 2000 files for
>> each partition
>> size of file is 10 MB .
>>
>> is there a way to get same performance but write lesser number of files ?
>>
>> I am trying repartition now but would like to know if there are any other
>> options.
>>
>> Thanks,
>> Asmath
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which is
>> the property of Persistent Systems Ltd. It is intended only for the use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>
>
>

Reply via email to