Re: Hdfs Working directory usage

Xiaoxiang Yu Sun, 10 Feb 2019 19:46:51 -0800

Hi, Ketan.

This is what I find:


- <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/cuboid 
    - This dir contains the cuboid data with each row contains dimensions array 
and MeasureAggregator array. 
    - The size is depend on the cardinality of each columns and it is often 
very large. 
    - When merge job completed, cuboid file of all segments which be merged 
successfully will be deleted automatically.
- <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/fact_distinct_columns 
    - This dir contains the distinct value of each column.
    - It should be deleted after current segment build job succeed.
- <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/hfile 
    - This dir contains data file which be bulk loaded into hbase.
    - It should be deleted after current segment build job succeed.
- <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/rowkey_stats 
    - Files under this dir are often very small, you may not need deleted them 
yourself.
    - These files are used to partition hfile.

I think you should update your auto-merge settings to let auto-merge more 
often, if you find any mistakes, please let me know, thank you!


----------------
Best wishes,
Xiaoxiang Yu 
 

On [DATE], "[NAME]" <[ADDRESS]> wrote:

    Hi team,
    Any updates on the same ?  
    
    Thanks,
    Ketan
    
    > On 01-Feb-2019, at 11:39 AM, ketan dikshit <kdcool6...@yahoo.com> wrote:
    > 
    > Hi Team,
    > 
    > We have a lot of data accumulated in our hdfs-working-directory, so we 
want to understand the usage of the following job data, once the job has been 
completed and segment is successfully created. 
    > 
    > <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/cuboid
    > 
<hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/fact_distinct_columns
    > <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/hfile
    > <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/rowkey_stats
    > 
    > Basically I need to understand the purpose of: 
cuboid,fact_distinct_columns,hfile,rowkey_stats after the job has built the 
cube segment (assuming we don’t use and merging/automerging of segments on the 
cube later).
    > 
    > The space taken up by these data in hdfs-working-dir is quite 
huge(affecting our costing), and is not getting cleaned by by cleanup 
job(org.apache.kylin.tool.StorageCleanupJob). So we need to be understand, that 
if we manually clean this up we will not get any issues later.
    > 
    > Thanks,
    > Ketan@Exponential

Re: Hdfs Working directory usage

Reply via email to