Hi Ryan,
Then back to the original topic: it should be okay if I break a Parquet file 
into multiple HDFS blocks, right?
Because when I was querying via Impala, there's a warning like: Parquet file 
should not be split into multiple hdfs-blocks.

Thanks!
Tianqi

-----Original Message-----
From: Ryan Blue [mailto:[email protected]] 
Sent: Monday, April 13, 2015 2:39 PM
To: [email protected]
Subject: Re: PARQUET_FILE_SIZE & parquet.block.size & dfs.blocksize

On 04/13/2015 02:21 PM, Tianqi Tong wrote:
> Hi Ryan,
> Thanks for the reply!
> The post was very useful to understand the the relationship between Parquet 
> block size and HDFS block size.
> I'm currently migrating a RCFile table to a Parquet table. Right now I'm 
> partitioning by month and prefix of a column, and I have over 500k+ 
> partitions in total. Does it hurt performance if I have that many partitions?
>
> Thank you!
> Tianqi

I'm glad it was helpful.

There's not necessarily anything wrong with 500,000+ partitions, but I don't 
think that data point alone is enough. Partitioning is always a trade-off.

You want your partitions to have enough data in them to avoid the small files 
problem. But, you also want the partitions them to be a good index into the 
data so you can avoid reading as much as possible. In general, I'd make sure 
partitions contain at least a few HDFS blocks worth of data and make sure files 
within are at least a full HDFS block.

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to