Re: Writing Parquet files with Flink

2016-01-29 Thread Flavio Pompermaier
Hi Fabian, thanks for the response! >From what is my understanding (correct me if I'm wrong) once I produce some Parquet dir that I want to read later, the number of files in the dir affects the initial parallelism of the next job, i.e.: - If I have less files than available tasks I will not

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske
Yes, make both block sizes the same and you're good. I think you can neglect the overhead, unless we are not talking about 1000's of small files (smaller than block size). 2016-01-29 12:06 GMT+01:00 Flavio Pompermaier : > So there's no need to worry about the number of

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske
The number of input splits does not depend on the number of files but on the number of HDFS blocks of all files. Reading a single file with 100 HDFS blocks and reading of 100 files with 1 block each should be divided into 100 input splits which can be read by 100 tasks concurrently (or less tasks

Re: Writing Parquet files with Flink

2016-01-29 Thread Flavio Pompermaier
So there's no need to worry about the number of parquet files size from the Flink point of view if I set correctly the parquet block size (equal to the HDFS block size)... It only affects the Parquet file overhead (header and footer present in each file) and the HDFS resources required to handle

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske
Hi Flavio, using a default FileOutputFormat, Flink writes one output file for each data sink task, i.e., as many files as the defined parallelism. The size of these files depends on the total output size and the distribution. If you write to HDFS, a file consists of one or more HDFS blocks.

Writing Parquet files with Flink

2016-01-28 Thread Flavio Pompermaier
Hi to all, I was reading about optimal Parquet file size and HDFS block size. The ideal situation for Parquet is when its block size (and thus the maximum size of each row group) is equal to the HDFS block size. The default behaviour of Flink is that the output file's size depends on the output