Re: Spark - Partitions

KhajaAsmath Mohammed Tue, 17 Oct 2017 19:06:07 -0700

    val unionDS = rawDS.union(processedDS)
      //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
      val unionedDS = unionDS.dropDuplicates()
      //val
unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK)
      //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
      unionDS.repartition(numPartitions);
      unionDS.createOrReplaceTempView("datapoint_prq_union_ds_view")
      sparkSession.sql(s"set hive.exec.dynamic.partition.mode=nonstrict")
      val deltaDSQry = "insert overwrite table  datapoint
PARTITION(year,month,day) select VIN, utctime, description, descriptionuom,
providerdesc, dt_map, islocation, latitude, longitude, speed,
value,current_date,YEAR, MONTH, DAY from datapoint_prq_union_ds_view"
      println(deltaDSQry)
      sparkSession.sql(deltaDSQry)



Here is the code and also properties used in my project.


On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu <sebastian....@gmail.com>
wrote:

> Can you share some code?
>
> On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <mdkhajaasm...@gmail.com>
> wrote:
>
>> In my case I am just writing the data frame back to hive. so when is the
>> best case to repartition it. I did repartition before calling insert
>> overwrite on table
>>
>> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <sebastian....@gmail.com>
>> wrote:
>>
>>> You have to repartition/coalesce *after *the action that is causing the
>>> shuffle as that one will take the value you've set
>>>
>>> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
>>> mdkhajaasm...@gmail.com> wrote:
>>>
>>>> Yes still I see more number of part files and exactly the number I have
>>>> defined did spark.sql.shuffle.partitions
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Oct 17, 2017, at 2:32 PM, Michael Artz <michaelea...@gmail.com>
>>>> wrote:
>>>>
>>>> Have you tried caching it and using a coalesce?
>>>>
>>>>
>>>>
>>>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <
>>>> mdkhajaasm...@gmail.com> wrote:
>>>>
>>>>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>>>>> precedence over repartitions or coalesce. how to get the lesser number of
>>>>> files with same performance?
>>>>>
>>>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>>>>> tushar_adesh...@persistent.com> wrote:
>>>>>
>>>>>> You can also try coalesce as it will avoid full shuffle.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> *Tushar Adeshara*
>>>>>>
>>>>>> *Technical Specialist – Analytics Practice*
>>>>>>
>>>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>>>>
>>>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>>>>>> *www.persistentsys.com
>>>>>> <http://www.persistentsys.com/>*
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>>>>>> *Sent:* 13 October 2017 09:35
>>>>>> *To:* user @spark
>>>>>> *Subject:* Spark - Partitions
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am reading hive query and wiriting the data back into hive after
>>>>>> doing some transformations.
>>>>>>
>>>>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>>>>>> then job completes fast but the main problem is I am getting 2000 files 
>>>>>> for
>>>>>> each partition
>>>>>> size of file is 10 MB .
>>>>>>
>>>>>> is there a way to get same performance but write lesser number of
>>>>>> files ?
>>>>>>
>>>>>> I am trying repartition now but would like to know if there are any
>>>>>> other options.
>>>>>>
>>>>>> Thanks,
>>>>>> Asmath
>>>>>> DISCLAIMER
>>>>>> ==========
>>>>>> This e-mail may contain privileged and confidential information which
>>>>>> is the property of Persistent Systems Ltd. It is intended only for the 
>>>>>> use
>>>>>> of the individual or entity to which it is addressed. If you are not the
>>>>>> intended recipient, you are not authorized to read, retain, copy, print,
>>>>>> distribute or use this message. If you have received this communication 
>>>>>> in
>>>>>> error, please notify the sender and delete all copies of this message.
>>>>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>>>>> mails.
>>>>>>
>>>>>
>>>>>
>>

application-datapoint-hdfs-dyn.properties
Description: Binary data

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark - Partitions

Reply via email to