Thanks. I tried this yesterday and it seems to be working. On Wed, Mar 2, 2016 at 1:49 AM, James Hammerton <ja...@gluru.co> wrote:
> Hi, > > Based on the behaviour I've seen using parquet, the number of partitions > in the DataFrame will determine the number of files in each parquet > partition. > > I.e. when you use "PARTITION BY" you're actually partitioning twice, once > via the partitions spark has created internally and then again with the > partitions you specify in the "PARTITION BY" clause. > > So if you have 10 partitions in your DataFrame, and save that as a parquet > file or table partitioned on a column with 3 values, you'll get 30 > partitions, 10 per parquet partition. > > You can reduce the number of partitions in the DataFrame by using > coalesce() before saving the data. > > Regards, > > James > > > On 1 March 2016 at 21:01, SRK <swethakasire...@gmail.com> wrote: > >> Hi, >> >> How can I control the number of parquet files getting created under a >> partition? I have my sqlContext queries to create a table and insert the >> records as follows. It seems to create around 250 parquet files under each >> partition though I was expecting that to create around 2 or 3 files. Due >> to >> the large number of files, it takes a lot of time to scan the records. Any >> suggestions as to how to control the number of parquet files under each >> partition would be of great help. >> >> sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS testUserDts >> (userId STRING, savedDate STRING) PARTITIONED BY (partitioner STRING) >> stored as PARQUET LOCATION '/user/testId/testUserDts' ") >> >> sqlContext.sql( >> """from testUserDtsTemp ps insert overwrite table testUserDts >> partition(partitioner) select ps.userId, ps.savedDate , ps.partitioner >> """.stripMargin) >> >> >> >> Thanks! >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-the-number-of-parquet-files-getting-created-under-a-partition-tp26374.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >