Re: Single Hdfs block per parquet file
Done, Thanks for the feedback https://issues.apache.org/jira/browse/DRILL-5379 On Thu, Mar 23, 2017 at 4:29 PM, Kunal Khatua <kkha...@mapr.com> wrote: > This seems like a reasonable feature request. It could also be expanded to > detect the underlying block size for the location being written to. > > > Could you file a JIRA for this? > > > Thanks > > Kunal > > > From: François Méthot <fmetho...@gmail.com> > Sent: Thursday, March 23, 2017 9:08:51 AM > To: dev@drill.apache.org > Subject: Re: Single Hdfs block per parquet file > > After further investigation, Drill uses the hadoop ParquetFileWriter ( > https://github.com/Parquet/parquet-mr/blob/master/ > parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java > ). > This is where the file creation occurs so it might be tricky after all. > > However ParquetRecordWriter.java ( > https://github.com/apache/drill/blob/master/exec/java- > exec/src/main/java/org/apache/drill/exec/store/parquet/ > ParquetRecordWriter.java) > in Drill creates the ParquetFileWriter with an hadoop configuration object. > > However something to explore: Could the block size be set as a property > within the Configuration object before passing it to ParquetFileWriter > constructor? > > François > > On Wed, Mar 22, 2017 at 11:55 PM, Padma Penumarthy <ppenumar...@mapr.com> > wrote: > > > Yes, seems like it is possible to create files with different block > sizes. > > We could potentially pass the configured store.parquet.block-size to the > > create call. > > I will try it out and see. will let you know. > > > > Thanks, > > Padma > > > > > > > On Mar 22, 2017, at 4:16 PM, François Méthot <fmetho...@gmail.com> > > wrote: > > > > > > Here are 2 links I could find: > > > > > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/ > > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop. > > fs.Path,%20boolean,%20int,%20short,%20long) > > > > > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/ > > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop. > > fs.Path,%20boolean,%20int,%20short,%20long) > > > > > > Francois > > > > > > On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy < > ppenumar...@mapr.com> > > > wrote: > > > > > >> I think we create one file for each parquet block. > > >> If underlying HDFS block size is 128 MB and parquet block size is > > > >> 128MB, > > >> it will create more blocks on HDFS. > > >> Can you let me know what is the HDFS API that would allow you to > > >> do otherwise ? > > >> > > >> Thanks, > > >> Padma > > >> > > >> > > >>> On Mar 22, 2017, at 11:54 AM, François Méthot <fmetho...@gmail.com> > > >> wrote: > > >>> > > >>> Hi, > > >>> > > >>> Is there a way to force Drill to store CTAS generated parquet file > as a > > >>> single block when using HDFS? Java HDFS API allows to do that, files > > >> could > > >>> be created with the Parquet block-size. > > >>> > > >>> We are using Drill on hdfs configured with block size of 128MB. > > Changing > > >>> this size is not an option at this point. > > >>> > > >>> It would be ideal for us to have single parquet file per hdfs block, > > >> setting > > >>> store.parquet.block-size to 128MB would fix our issue but we end up > > with > > >> a > > >>> lot more files to deal with. > > >>> > > >>> Thanks > > >>> Francois > > >> > > >> > > > > >
Re: Single Hdfs block per parquet file
This seems like a reasonable feature request. It could also be expanded to detect the underlying block size for the location being written to. Could you file a JIRA for this? Thanks Kunal From: François Méthot <fmetho...@gmail.com> Sent: Thursday, March 23, 2017 9:08:51 AM To: dev@drill.apache.org Subject: Re: Single Hdfs block per parquet file After further investigation, Drill uses the hadoop ParquetFileWriter ( https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java ). This is where the file creation occurs so it might be tricky after all. However ParquetRecordWriter.java ( https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java) in Drill creates the ParquetFileWriter with an hadoop configuration object. However something to explore: Could the block size be set as a property within the Configuration object before passing it to ParquetFileWriter constructor? François On Wed, Mar 22, 2017 at 11:55 PM, Padma Penumarthy <ppenumar...@mapr.com> wrote: > Yes, seems like it is possible to create files with different block sizes. > We could potentially pass the configured store.parquet.block-size to the > create call. > I will try it out and see. will let you know. > > Thanks, > Padma > > > > On Mar 22, 2017, at 4:16 PM, François Méthot <fmetho...@gmail.com> > wrote: > > > > Here are 2 links I could find: > > > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/ > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop. > fs.Path,%20boolean,%20int,%20short,%20long) > > > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/ > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop. > fs.Path,%20boolean,%20int,%20short,%20long) > > > > Francois > > > > On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy <ppenumar...@mapr.com> > > wrote: > > > >> I think we create one file for each parquet block. > >> If underlying HDFS block size is 128 MB and parquet block size is > > >> 128MB, > >> it will create more blocks on HDFS. > >> Can you let me know what is the HDFS API that would allow you to > >> do otherwise ? > >> > >> Thanks, > >> Padma > >> > >> > >>> On Mar 22, 2017, at 11:54 AM, François Méthot <fmetho...@gmail.com> > >> wrote: > >>> > >>> Hi, > >>> > >>> Is there a way to force Drill to store CTAS generated parquet file as a > >>> single block when using HDFS? Java HDFS API allows to do that, files > >> could > >>> be created with the Parquet block-size. > >>> > >>> We are using Drill on hdfs configured with block size of 128MB. > Changing > >>> this size is not an option at this point. > >>> > >>> It would be ideal for us to have single parquet file per hdfs block, > >> setting > >>> store.parquet.block-size to 128MB would fix our issue but we end up > with > >> a > >>> lot more files to deal with. > >>> > >>> Thanks > >>> Francois > >> > >> > >
Re: Single Hdfs block per parquet file
After further investigation, Drill uses the hadoop ParquetFileWriter ( https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java ). This is where the file creation occurs so it might be tricky after all. However ParquetRecordWriter.java ( https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java) in Drill creates the ParquetFileWriter with an hadoop configuration object. However something to explore: Could the block size be set as a property within the Configuration object before passing it to ParquetFileWriter constructor? François On Wed, Mar 22, 2017 at 11:55 PM, Padma Penumarthywrote: > Yes, seems like it is possible to create files with different block sizes. > We could potentially pass the configured store.parquet.block-size to the > create call. > I will try it out and see. will let you know. > > Thanks, > Padma > > > > On Mar 22, 2017, at 4:16 PM, François Méthot > wrote: > > > > Here are 2 links I could find: > > > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/ > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop. > fs.Path,%20boolean,%20int,%20short,%20long) > > > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/ > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop. > fs.Path,%20boolean,%20int,%20short,%20long) > > > > Francois > > > > On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy > > wrote: > > > >> I think we create one file for each parquet block. > >> If underlying HDFS block size is 128 MB and parquet block size is > > >> 128MB, > >> it will create more blocks on HDFS. > >> Can you let me know what is the HDFS API that would allow you to > >> do otherwise ? > >> > >> Thanks, > >> Padma > >> > >> > >>> On Mar 22, 2017, at 11:54 AM, François Méthot > >> wrote: > >>> > >>> Hi, > >>> > >>> Is there a way to force Drill to store CTAS generated parquet file as a > >>> single block when using HDFS? Java HDFS API allows to do that, files > >> could > >>> be created with the Parquet block-size. > >>> > >>> We are using Drill on hdfs configured with block size of 128MB. > Changing > >>> this size is not an option at this point. > >>> > >>> It would be ideal for us to have single parquet file per hdfs block, > >> setting > >>> store.parquet.block-size to 128MB would fix our issue but we end up > with > >> a > >>> lot more files to deal with. > >>> > >>> Thanks > >>> Francois > >> > >> > >
Re: Single Hdfs block per parquet file
Yes, seems like it is possible to create files with different block sizes. We could potentially pass the configured store.parquet.block-size to the create call. I will try it out and see. will let you know. Thanks, Padma > On Mar 22, 2017, at 4:16 PM, François Méthotwrote: > > Here are 2 links I could find: > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long) > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long) > > Francois > > On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy > wrote: > >> I think we create one file for each parquet block. >> If underlying HDFS block size is 128 MB and parquet block size is > >> 128MB, >> it will create more blocks on HDFS. >> Can you let me know what is the HDFS API that would allow you to >> do otherwise ? >> >> Thanks, >> Padma >> >> >>> On Mar 22, 2017, at 11:54 AM, François Méthot >> wrote: >>> >>> Hi, >>> >>> Is there a way to force Drill to store CTAS generated parquet file as a >>> single block when using HDFS? Java HDFS API allows to do that, files >> could >>> be created with the Parquet block-size. >>> >>> We are using Drill on hdfs configured with block size of 128MB. Changing >>> this size is not an option at this point. >>> >>> It would be ideal for us to have single parquet file per hdfs block, >> setting >>> store.parquet.block-size to 128MB would fix our issue but we end up with >> a >>> lot more files to deal with. >>> >>> Thanks >>> Francois >> >>
Re: Single Hdfs block per parquet file
Here are 2 links I could find: http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long) http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long) Francois On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthywrote: > I think we create one file for each parquet block. > If underlying HDFS block size is 128 MB and parquet block size is > > 128MB, > it will create more blocks on HDFS. > Can you let me know what is the HDFS API that would allow you to > do otherwise ? > > Thanks, > Padma > > > > On Mar 22, 2017, at 11:54 AM, François Méthot > wrote: > > > > Hi, > > > > Is there a way to force Drill to store CTAS generated parquet file as a > > single block when using HDFS? Java HDFS API allows to do that, files > could > > be created with the Parquet block-size. > > > > We are using Drill on hdfs configured with block size of 128MB. Changing > > this size is not an option at this point. > > > > It would be ideal for us to have single parquet file per hdfs block, > setting > > store.parquet.block-size to 128MB would fix our issue but we end up with > a > > lot more files to deal with. > > > > Thanks > > Francois > >
Re: Single Hdfs block per parquet file
I think we create one file for each parquet block. If underlying HDFS block size is 128 MB and parquet block size is > 128MB, it will create more blocks on HDFS. Can you let me know what is the HDFS API that would allow you to do otherwise ? Thanks, Padma > On Mar 22, 2017, at 11:54 AM, François Méthotwrote: > > Hi, > > Is there a way to force Drill to store CTAS generated parquet file as a > single block when using HDFS? Java HDFS API allows to do that, files could > be created with the Parquet block-size. > > We are using Drill on hdfs configured with block size of 128MB. Changing > this size is not an option at this point. > > It would be ideal for us to have single parquet file per hdfs block, setting > store.parquet.block-size to 128MB would fix our issue but we end up with a > lot more files to deal with. > > Thanks > Francois
Single Hdfs block per parquet file
Hi, Is there a way to force Drill to store CTAS generated parquet file as a single block when using HDFS? Java HDFS API allows to do that, files could be created with the Parquet block-size. We are using Drill on hdfs configured with block size of 128MB. Changing this size is not an option at this point. It would be ideal for us to have single parquet file per hdfs block, setting store.parquet.block-size to 128MB would fix our issue but we end up with a lot more files to deal with. Thanks Francois