[GitHub] carbondata pull request #2805: [Documentation] Local dictionary Data which a...

sraghunandan Mon, 22 Oct 2018 02:29:59 -0700

Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2805#discussion_r226924568
  
    --- Diff: docs/configuration-parameters.md ---
    @@ -75,7 +75,7 @@ This section provides the details of all the 
configurations required for the Car
     | carbon.use.multiple.temp.dir | false | When multiple disks are present 
in the system, YARN is generally configured with multiple disks to be used as 
temp directories for managing the containers. This configuration specifies 
whether to use multiple YARN local directories during data loading for disk IO 
load balancing.Enable ***carbon.use.local.dir*** for this configuration to take 
effect. **NOTE:** Data Loading is an IO intensive operation whose performance 
can be limited by the disk IO threshold, particularly during multi table 
concurrent data load.Configuring this parameter, balances the disk IO across 
multiple disks there by improving the over all load performance. |
     | carbon.sort.temp.compressor | (none) | CarbonData writes every 
***carbon.sort.size*** number of records to intermediate temp files during data 
loading to ensure memory footprint is within limits. These temporary files can 
be compressed and written in order to save the storage space. This 
configuration specifies the name of compressor to be used to compress the 
intermediate sort temp files during sort procedure in data loading. The valid 
values are 'SNAPPY','GZIP','BZIP2','LZ4','ZSTD' and empty. By default, empty 
means that Carbondata will not compress the sort temp files. **NOTE:** 
Compressor will be useful if you encounter disk bottleneck.Since the data needs 
to be compressed and decompressed,it involves additional CPU cycles,but is 
compensated by the high IO throughput due to less data to be written or read 
from the disks. |
     | carbon.load.skewedDataOptimization.enabled | false | During data 
loading,CarbonData would divide the number of blocks equally so as to ensure 
all executors process same number of blocks. This mechanism satisfies most of 
the scenarios and ensures maximum parallel processing for optimal data loading 
performance.In some business scenarios, there might be scenarios where the size 
of blocks vary significantly and hence some executors would have to do more 
work if they get blocks containing more data. This configuration enables size 
based block allocation strategy for data loading. When loading, carbondata will 
use file size based block allocation strategy for task distribution. It will 
make sure that all the executors process the same size of data.**NOTE:** This 
configuration is useful if the size of your input data files varies widely, say 
1MB to 1GB.For this configuration to work effectively,knowing the data pattern 
and size is important and necessary. |
    -| carbon.load.min.size.enabled | false | During Data Loading, CarbonData 
would divide the number of files among the available executors to parallelize 
the loading operation. When the input data files are very small, this action 
causes to generate many small carbondata files. This configuration determines 
whether to enable node minumun input data size allocation strategy for data 
loading.It will make sure that the node load the minimum amount of data there 
by reducing number of carbondata files.**NOTE:** This configuration is useful 
if the size of the input data files are very small, like 1MB to 256MB.Refer to 
***load_min_size_inmb*** to configure the minimum size to be considered for 
splitting files among executors. |
    +| carbon.load.min.size.enabled | false | During Data Loading, CarbonData 
would divide the number of files among the available executors to parallelize 
the loading operation. When the input data files are very small, this action 
causes to generate many small carbondata files. This configuration determines 
whether to enable node minumun input data size allocation strategy for data 
loading. It will make sure that the nodes load the minimum amount of data there 
by reducing number of carbondata files.**NOTE:** This configuration is useful 
if the size of the input data files are very small, like 1MB to 256MB.Refer to 
***load_min_size_inmb*** to configure the minimum size to be considered for 
splitting files among executors. |
    --- End diff --
    
    give space after full stops

---

[GitHub] carbondata pull request #2805: [Documentation] Local dictionary Data which a...

Reply via email to