The following should work as long as your tables are created using Spark SQL

event_wk.repartition(2).write.partitionBy("eventDate").format("parquet"
).insertInto("event)

If you want to stick to using "insert overwrite" for Hive compatibility,
then you can repartition twice, instead of setting the global
spark.sql.shuffle.partition parameter

df eventwk = sqlContext.sql("some joins") // this should use the global
shuffle partition parameter
df eventwkRepartitioned = eventwk.repartition(2)
eventwkRepartitioned.registerTempTable("event_wk_repartitioned")
and use this in your insert statement.

registering temp table is cheap

HTH


On 29 January 2016 at 20:26, Benyi Wang <bewang.t...@gmail.com> wrote:

> I want to insert into a partition table using dynamic partition, but I
> don’t want to have 200 files for a partition because the files will be
> small for my case.
>
> sqlContext.sql(  """
>     |insert overwrite table event
>     |partition(eventDate)
>     |select
>     | user,
>     | detail,
>     | eventDate
>     |from event_wk
>   """.stripMargin)
>
> the table “event_wk” is created from a dataframe by registerTempTable,
> which is built with some joins. If I set spark.sql.shuffle.partition=2, the
> join’s performance will be bad because that property seems global.
>
> I can do something like this:
>
> event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path)
>
> but I have to handle adding partitions by myself.
>
> Is there a way you can control the number of files just for this last
> insert step?
> ​
>

Reply via email to