Re: Spark - Partitions

Sebastian Piu Tue, 17 Oct 2017 13:39:08 -0700

Can you share some code?

On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <mdkhajaasm...@gmail.com>
wrote:


> In my case I am just writing the data frame back to hive. so when is the
> best case to repartition it. I did repartition before calling insert
> overwrite on table
>
> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <sebastian....@gmail.com>
> wrote:
>
>> You have to repartition/coalesce *after *the action that is causing the
>> shuffle as that one will take the value you've set
>>
>> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
>> mdkhajaasm...@gmail.com> wrote:
>>
>>> Yes still I see more number of part files and exactly the number I have
>>> defined did spark.sql.shuffle.partitions
>>>
>>> Sent from my iPhone
>>>
>>> On Oct 17, 2017, at 2:32 PM, Michael Artz <michaelea...@gmail.com>
>>> wrote:
>>>
>>> Have you tried caching it and using a coalesce?
>>>
>>>
>>>
>>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com>
>>> wrote:
>>>
>>>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>>>> precedence over repartitions or coalesce. how to get the lesser number of
>>>> files with same performance?
>>>>
>>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>>>> tushar_adesh...@persistent.com> wrote:
>>>>
>>>>> You can also try coalesce as it will avoid full shuffle.
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> *Tushar Adeshara*
>>>>>
>>>>> *Technical Specialist – Analytics Practice*
>>>>>
>>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>>>
>>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>>>>> *www.persistentsys.com
>>>>> <http://www.persistentsys.com/>*
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>>>>> *Sent:* 13 October 2017 09:35
>>>>> *To:* user @spark
>>>>> *Subject:* Spark - Partitions
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am reading hive query and wiriting the data back into hive after
>>>>> doing some transformations.
>>>>>
>>>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>>>>> then job completes fast but the main problem is I am getting 2000 files 
>>>>> for
>>>>> each partition
>>>>> size of file is 10 MB .
>>>>>
>>>>> is there a way to get same performance but write lesser number of
>>>>> files ?
>>>>>
>>>>> I am trying repartition now but would like to know if there are any
>>>>> other options.
>>>>>
>>>>> Thanks,
>>>>> Asmath
>>>>> DISCLAIMER
>>>>> ==========
>>>>> This e-mail may contain privileged and confidential information which
>>>>> is the property of Persistent Systems Ltd. It is intended only for the use
>>>>> of the individual or entity to which it is addressed. If you are not the
>>>>> intended recipient, you are not authorized to read, retain, copy, print,
>>>>> distribute or use this message. If you have received this communication in
>>>>> error, please notify the sender and delete all copies of this message.
>>>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>>>> mails.
>>>>>
>>>>
>>>>
>

Re: Spark - Partitions

Reply via email to