Hi, Ketan.
This is what I find:
- <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/cuboid
- This dir contains the cuboid data with each row contains dimensions array
and MeasureAggregator array.
- The size is depend on the cardinality of each columns and it is often
very large.
- When merge job completed, cuboid file of all segments which be merged
successfully will be deleted automatically.
- <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/fact_distinct_columns
- This dir contains the distinct value of each column.
- It should be deleted after current segment build job succeed.
- <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/hfile
- This dir contains data file which be bulk loaded into hbase.
- It should be deleted after current segment build job succeed.
- <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/rowkey_stats
- Files under this dir are often very small, you may not need deleted them
yourself.
- These files are used to partition hfile.
I think you should update your auto-merge settings to let auto-merge more
often, if you find any mistakes, please let me know, thank you!
----------------
Best wishes,
Xiaoxiang Yu
On [DATE], "[NAME]" <[ADDRESS]> wrote:
Hi team,
Any updates on the same ?
Thanks,
Ketan
> On 01-Feb-2019, at 11:39 AM, ketan dikshit <[email protected]> wrote:
>
> Hi Team,
>
> We have a lot of data accumulated in our hdfs-working-directory, so we
want to understand the usage of the following job data, once the job has been
completed and segment is successfully created.
>
> <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/cuboid
>
<hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/fact_distinct_columns
> <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/hfile
> <hdfs-working-dir>/<metdata-name>/<job-id>/<cube-name>/rowkey_stats
>
> Basically I need to understand the purpose of:
cuboid,fact_distinct_columns,hfile,rowkey_stats after the job has built the
cube segment (assuming we don’t use and merging/automerging of segments on the
cube later).
>
> The space taken up by these data in hdfs-working-dir is quite
huge(affecting our costing), and is not getting cleaned by by cleanup
job(org.apache.kylin.tool.StorageCleanupJob). So we need to be understand, that
if we manually clean this up we will not get any issues later.
>
> Thanks,
> Ketan@Exponential