Not responding to the real question in the thread but "I'm using NIFI 1.13.1.". Please switch to 1.13.2 right away due to a regression in 1.13.1
On Mon, Mar 22, 2021 at 12:24 AM Vibhath Ileperuma <vibhatharunapr...@gmail.com> wrote: > > Hi Bryan, > > I'm planning to add these generated parquet files to an impala S3 table. > I noticed that impala written parquet files contain only one row group. > That's why I'm trying to write one row group per file. > > However, I tried to create small parquet files (Snappy compressed) first and > use a MergeRecord Processor with a ParquetRecordSetWriter in which the row > group size is set to 256 MB to generate parquet files with one row group. The > configurations I used, > > Merge Strategy: Bin-Packing Algorithm > Minimum Number of Records: 1 > Maximum Number of Records: 2500000 (2.5 million) > Minimum Bin Size : 230 MB > Maximum Bin Size : 256 MB > Max Bin Age: 20 minutes > > Note that, above mentioned small parquet files usually contain 200,000 > records and size is about 21 MB- 22 MB. Hence about 12 files should be merged > to generate one file. > > But when I run the processor, it always merges 19 files and generates files > of size 415 MB - 417 MB. > > I'm using NIFI 1.13.1. Could you please let me know how to resolve this issue. > > Thanks & Regards > > Vibhath Ileperuma > > > > > > On Fri, Mar 19, 2021 at 8:45 PM Bryan Bende <bbe...@gmail.com> wrote: >> >> Hello, >> >> What would the reason be to need only one row group per file? Parquet >> files by design can have many row groups. >> >> The ParquetRecordSetWriter won't be able to do this since it is just >> given an output stream to write all the records to, which happens to >> be the outputstream for one flow file. >> >> -Bryan >> >> On Fri, Mar 19, 2021 at 10:31 AM Vibhath Ileperuma >> <vibhatharunapr...@gmail.com> wrote: >> > >> > Hi all, >> > >> > I'm developing a NIFI flow to convert a set of csv data to parquet format >> > and upload them to a S3 bucket. I use a 'ConvertRecord' processor with a >> > csv reader and a parquet record set writer to convert data and use a >> > 'PutS3Object' to send it to S3 bucket. >> > >> > When converting, I need to make sure the parquet row group size is 256 MB >> > and each parquet file contains only one row group. Even Though it is >> > possible to set the row group size in ParquetRecordSetWriter, I couldn't >> > find a way to make sure each parquet file contains only one row group (If >> > a csv file contains data more than required for a 256MB row group, >> > multiple parquet files should be generated). >> > >> > I would be grateful if you could suggest a way to do this. >> > >> > Thanks & Regards >> > >> > Vibhath Ileperuma >> > >> > >> >