Thanks Deenar, both two methods work. I actually tried the second method in spark-shell, but it didn't work at that time. The reason might be: I registered the data frame eventwk as a temporary table, repartition, then register the table again. Unfortunately I could not reproduce it.
Thanks again. On Sat, Jan 30, 2016 at 1:25 AM, Deenar Toraskar <deenar.toras...@gmail.com> wrote: > The following should work as long as your tables are created using Spark > SQL > > event_wk.repartition(2).write.partitionBy("eventDate").format("parquet" > ).insertInto("event) > > If you want to stick to using "insert overwrite" for Hive compatibility, > then you can repartition twice, instead of setting the global > spark.sql.shuffle.partition parameter > > df eventwk = sqlContext.sql("some joins") // this should use the global > shuffle partition parameter > df eventwkRepartitioned = eventwk.repartition(2) > eventwkRepartitioned.registerTempTable("event_wk_repartitioned") > and use this in your insert statement. > > registering temp table is cheap > > HTH > > > On 29 January 2016 at 20:26, Benyi Wang <bewang.t...@gmail.com> wrote: > >> I want to insert into a partition table using dynamic partition, but I >> don’t want to have 200 files for a partition because the files will be >> small for my case. >> >> sqlContext.sql( """ >> |insert overwrite table event >> |partition(eventDate) >> |select >> | user, >> | detail, >> | eventDate >> |from event_wk >> """.stripMargin) >> >> the table “event_wk” is created from a dataframe by registerTempTable, >> which is built with some joins. If I set spark.sql.shuffle.partition=2, the >> join’s performance will be bad because that property seems global. >> >> I can do something like this: >> >> event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path) >> >> but I have to handle adding partitions by myself. >> >> Is there a way you can control the number of files just for this last >> insert step? >> >> > >