Re: Writing parquet files to S3

Joe Witt Mon, 22 Mar 2021 06:58:02 -0700

Not responding to the real question in the thread but "I'm using NIFI
1.13.1.".  Please switch to 1.13.2 right away due to a regression in
1.13.1



On Mon, Mar 22, 2021 at 12:24 AM Vibhath Ileperuma
<vibhatharunapr...@gmail.com> wrote:
>
> Hi Bryan,
>
> I'm planning to add these generated parquet files to an impala S3 table.
> I noticed that impala written parquet files contain only one row group. 
> That's why I'm trying to write one row group per file.
>
> However, I tried to create small parquet files (Snappy compressed) first and 
> use a MergeRecord Processor with a ParquetRecordSetWriter in which the row 
> group size is set to 256 MB to generate parquet files with one row group. The 
> configurations I used,
>
> Merge Strategy: Bin-Packing Algorithm
> Minimum Number of Records: 1
> Maximum Number of Records: 2500000   (2.5 million)
>  Minimum Bin Size : 230 MB
> Maximum Bin Size : 256 MB
> Max Bin Age: 20 minutes
>
> Note that, above mentioned small parquet files usually contain 200,000 
> records and size is about 21 MB- 22 MB. Hence about 12 files should be merged 
> to generate one file.
>
> But when I run the processor, it always merges 19 files and generates files 
> of size 415 MB - 417 MB.
>
> I'm using NIFI 1.13.1. Could you please let me know how to resolve this issue.
>
> Thanks & Regards
>
> Vibhath Ileperuma
>
>
>
>
>
> On Fri, Mar 19, 2021 at 8:45 PM Bryan Bende <bbe...@gmail.com> wrote:
>>
>> Hello,
>>
>> What would the reason be to need only one row group per file? Parquet
>> files by design can have many row groups.
>>
>> The ParquetRecordSetWriter won't be able to do this since it is just
>> given an output stream to write all the records to, which happens to
>> be the outputstream for one flow file.
>>
>> -Bryan
>>
>> On Fri, Mar 19, 2021 at 10:31 AM Vibhath Ileperuma
>> <vibhatharunapr...@gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > I'm developing a NIFI flow to convert a set of csv data to parquet format 
>> > and upload them to a S3 bucket. I use a 'ConvertRecord' processor with a 
>> > csv reader and a parquet record set writer to convert data and use a 
>> > 'PutS3Object' to send it to S3 bucket.
>> >
>> > When converting, I need to make sure the parquet row group size is 256 MB 
>> > and each parquet file contains only one row group. Even Though it is 
>> > possible to set the row group size in ParquetRecordSetWriter, I couldn't 
>> > find a way to make sure each parquet file contains only one row group (If 
>> > a csv file contains data  more than required for a 256MB row group, 
>> > multiple parquet files should be generated).
>> >
>> > I would be grateful if you could suggest a way to do this.
>> >
>> > Thanks & Regards
>> >
>> > Vibhath Ileperuma
>> >
>> >
>> >

Re: Writing parquet files to S3

Reply via email to