[ 
https://issues.apache.org/jira/browse/PARQUET-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16800515#comment-16800515
 ] 

Gabor Szadovszky commented on PARQUET-1549:
-------------------------------------------

What is not clear for me in your design is how the different file names/paths 
are generated. The current way of finalizing/padding a row-group is based on 
the configuration and driven by the parquet-mr library. How can the act of 
ending the current file and starting the new one can be driven by the library 
if it does not know the requested name. Maybe a kind of name generator 
interface can help but not sure if it would not over-complicate the design.

But, why would we need this implementation in the first place? Currently, 
parquet-mr handles the row-groups (in the different blocks) parallel (processed 
on different nodes) by using the haddop InputSplits. This way it does not 
matter if the row-group is a different file or only a separate hdfs block of 
the file. If Impala cannot handle the row-groups similarly then, I think, it is 
a lack of functionality at the Impala side and not at parquet-mr side.

> Option for one block per file in MapReduce output
> -------------------------------------------------
>
>                 Key: PARQUET-1549
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1549
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>    Affects Versions: 1.10.0
>            Reporter: Gustavo Figueiredo
>            Priority: Minor
>
> When we create PARQUET files using a MapReduce application with current 
> ParquetOutputFormat implementation, we don't have any option to reliably 
> limit the number of blocks (row groups) we want to generate per file.
> The implemented configuration option 'parquet.block.size' 
> (ParquetOutputFormat.BLOCK_SIZE) refers to the amount of data that goes into 
> one block of data, but there are no guarantees that this will be the only 
> block in a file. If one sets this configuration option to a very high value, 
> it's likely there will be a single block per PARQUET file. However, this 
> approach might lead to undesirably big files, so this would not be a good 
> option in some scenarios.
> This behaviour can't be achieved by the client's 'mapper' either. Although 
> there are some helpfull classes in Hadoop API, such as 'MultipleOutputs', we 
> don't have enough information available at 'mapper' code in order to have 
> this kind of control, unless one uses unsafe 'hacks' to gather information 
> from private fields.
> By instance, suppose we have an ETL application that loads data from HBASE 
> regions (might be one or more MAPs per region) and produces PARQUET files to 
> be consumed in IMPALA tables (might be one or more PARQUET files per MAP 
> task). To simplify, let's say there is no 'REDUCE' task in this application.
> For concreteness, lets say one could use for such job 
> 'org.apache.hadoop.hbase.mapreduce.TableInputFormat' as input and 
> 'org.apache.parquet.hadoop.ParquetOutputFormat' as output. 
> Following the guidelines for maximum query performance in Impala queries in 
> HADOOP ecosystem, each PARQUET file should be approximately equal in size to 
> a HDFS block and there should be only one single block of data (row group) in 
> each of them (see 
> https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html#perf_cookbook__perf_cookbook_parquet_block_size).
> Currently we are only able to do this by trial and error with different 
> configuration options.
> It would be nice to have a new boolean configuration option (lets call it 
> 'parquet.split.file.per.block') related to the existing one 
> 'parquet.block.size'. If it's set to false (default value), we would have the 
> current behaviour. If it's to true, we would have one different PARQUET file 
> being generated for each 'block' created, all coming from the same 
> ParquetRecordWriter.
> In doing so, we would only have to worry about tuning the 
> 'parquet.block.size' parameter in order to generate PARQUET files with one 
> single block per file whose size is closer to the configured HDFS block size.
>  
> In order to implement this new feature, we only need to change a few classes 
> in 'org.apache.parquet.hadoop' package, namely:
>  InternalParquetRecordWriter
>  ParquetFileWriter
>  ParquetOutputFormat
>  ParquetRecordWriter
> Briefly, these are the changes needed:
>  InternalParquetRecordWriter:
>  The field 'ParquetFileWriter parquetFileWriter' should not be 'final' 
> anymore, since we want to be able to change this throughout the task.
>  The method 'checkBlockSizeReached' should call a new function 'startNewFile' 
> just after a call to 'flushRowGroupToStore'.
>  The new method 'startNewFile' should have all the logic for closing the 
> current file and starting a new one at the same location with a proper 
> filename.
>  
>  ParquetFileWriter
>  The constructor argument 'OutputFile file' should be persisted as a new 
> member field and made available by a new public method. This information is 
> usefull for the 'startNewFile' implementation mentioned above.
>  The field 'MessageType schema' should be available by a new public method. 
> This information is also usefull for the 'startNewFile' implementation.
>  
>  ParquetOutputFormat
>  The existing private method 'getMaxPaddingSize' should be made 'public' or 
> at least 'package protected'. This information is usefull for the 
> 'startNewFile' implementation mentioned above.
>  The new configuration option 'parquet.split.file.per.block' should be 
> specified here like the other ones. The new behaviour in 
> 'InternalParquetRecordWriter' is conditioned on this configuration option.
>  
>  ParquetRecordWriter
>  Just pass away the configuration option to the internal 
> InternalParquetRecordWriter instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to