I want to insert into a partition table using dynamic partition, but I
don’t want to have 200 files for a partition because the files will be
small for my case.

sqlContext.sql(  """
    |insert overwrite table event
    |partition(eventDate)
    |select
    | user,
    | detail,
    | eventDate
    |from event_wk
  """.stripMargin)

the table “event_wk” is created from a dataframe by registerTempTable,
which is built with some joins. If I set spark.sql.shuffle.partition=2, the
join’s performance will be bad because that property seems global.

I can do something like this:

event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path)

but I have to handle adding partitions by myself.

Is there a way you can control the number of files just for this last
insert step?
​

Reply via email to