Re: How to control the number of parquet files getting created under a partition ?

swetha kasireddy Wed, 02 Mar 2016 09:43:50 -0800

Thanks. I tried this yesterday and it seems to be working.

On Wed, Mar 2, 2016 at 1:49 AM, James Hammerton <ja...@gluru.co> wrote:


> Hi,
>
> Based on the behaviour I've seen using parquet, the number of partitions
> in the DataFrame will determine the number of files in each parquet
> partition.
>
> I.e. when you use "PARTITION BY" you're actually partitioning twice, once
> via the partitions spark has created internally and then again with the
> partitions you specify in the "PARTITION BY" clause.
>
> So if you have 10 partitions in your DataFrame, and save that as a parquet
> file or table partitioned on a column with 3 values, you'll get 30
> partitions, 10 per parquet partition.
>
> You can reduce the number of partitions in the DataFrame by using
> coalesce() before saving the data.
>
> Regards,
>
> James
>
>
> On 1 March 2016 at 21:01, SRK <swethakasire...@gmail.com> wrote:
>
>> Hi,
>>
>> How can I control the number of parquet files getting created under a
>> partition? I have my sqlContext queries to create a table and insert the
>> records as follows. It seems to create around 250 parquet files under each
>> partition though I was expecting that to create around 2 or 3 files. Due
>> to
>> the large number of files, it takes a lot of time to scan the records. Any
>> suggestions as to how to control the number of parquet files under each
>> partition would be of great help.
>>
>>      sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS testUserDts
>> (userId STRING, savedDate STRING) PARTITIONED BY (partitioner STRING)
>> stored as PARQUET LOCATION '/user/testId/testUserDts' ")
>>
>>       sqlContext.sql(
>>         """from testUserDtsTemp ps   insert overwrite table testUserDts
>> partition(partitioner)  select ps.userId, ps.savedDate ,  ps.partitioner
>> """.stripMargin)
>>
>>
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-the-number-of-parquet-files-getting-created-under-a-partition-tp26374.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: How to control the number of parquet files getting created under a partition ?

Reply via email to