val unionDS = rawDS.union(processedDS) //unionDS.persist(StorageLevel.MEMORY_AND_DISK) val unionedDS = unionDS.dropDuplicates() //val unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK) //unionDS.persist(StorageLevel.MEMORY_AND_DISK) unionDS.repartition(numPartitions); unionDS.createOrReplaceTempView("datapoint_prq_union_ds_view") sparkSession.sql(s"set hive.exec.dynamic.partition.mode=nonstrict") val deltaDSQry = "insert overwrite table datapoint PARTITION(year,month,day) select VIN, utctime, description, descriptionuom, providerdesc, dt_map, islocation, latitude, longitude, speed, value,current_date,YEAR, MONTH, DAY from datapoint_prq_union_ds_view" println(deltaDSQry) sparkSession.sql(deltaDSQry)
Here is the code and also properties used in my project. On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu <sebastian....@gmail.com> wrote: > Can you share some code? > > On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <mdkhajaasm...@gmail.com> > wrote: > >> In my case I am just writing the data frame back to hive. so when is the >> best case to repartition it. I did repartition before calling insert >> overwrite on table >> >> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <sebastian....@gmail.com> >> wrote: >> >>> You have to repartition/coalesce *after *the action that is causing the >>> shuffle as that one will take the value you've set >>> >>> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed < >>> mdkhajaasm...@gmail.com> wrote: >>> >>>> Yes still I see more number of part files and exactly the number I have >>>> defined did spark.sql.shuffle.partitions >>>> >>>> Sent from my iPhone >>>> >>>> On Oct 17, 2017, at 2:32 PM, Michael Artz <michaelea...@gmail.com> >>>> wrote: >>>> >>>> Have you tried caching it and using a coalesce? >>>> >>>> >>>> >>>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" < >>>> mdkhajaasm...@gmail.com> wrote: >>>> >>>>> I tried repartitions but spark.sql.shuffle.partitions is taking up >>>>> precedence over repartitions or coalesce. how to get the lesser number of >>>>> files with same performance? >>>>> >>>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara < >>>>> tushar_adesh...@persistent.com> wrote: >>>>> >>>>>> You can also try coalesce as it will avoid full shuffle. >>>>>> >>>>>> >>>>>> Regards, >>>>>> >>>>>> *Tushar Adeshara* >>>>>> >>>>>> *Technical Specialist – Analytics Practice* >>>>>> >>>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>* >>>>>> >>>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* >>>>>> *www.persistentsys.com >>>>>> <http://www.persistentsys.com/>* >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> >>>>>> *Sent:* 13 October 2017 09:35 >>>>>> *To:* user @spark >>>>>> *Subject:* Spark - Partitions >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am reading hive query and wiriting the data back into hive after >>>>>> doing some transformations. >>>>>> >>>>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since >>>>>> then job completes fast but the main problem is I am getting 2000 files >>>>>> for >>>>>> each partition >>>>>> size of file is 10 MB . >>>>>> >>>>>> is there a way to get same performance but write lesser number of >>>>>> files ? >>>>>> >>>>>> I am trying repartition now but would like to know if there are any >>>>>> other options. >>>>>> >>>>>> Thanks, >>>>>> Asmath >>>>>> DISCLAIMER >>>>>> ========== >>>>>> This e-mail may contain privileged and confidential information which >>>>>> is the property of Persistent Systems Ltd. It is intended only for the >>>>>> use >>>>>> of the individual or entity to which it is addressed. If you are not the >>>>>> intended recipient, you are not authorized to read, retain, copy, print, >>>>>> distribute or use this message. If you have received this communication >>>>>> in >>>>>> error, please notify the sender and delete all copies of this message. >>>>>> Persistent Systems Ltd. does not accept any liability for virus infected >>>>>> mails. >>>>>> >>>>> >>>>> >>
application-datapoint-hdfs-dyn.properties
Description: Binary data
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org