Hi Kylin Team,

I am a new user of Apache Kylin who started exploring for our MOLAP
requirements.
Thanks so much for Kylin community for such an offering.

While looking at the way Kylin prepares the offline data,

Create Intermediate Flat Hive Table
Extract Fact Table Distinct Columns
Build Dimension Dictionary
Build Base Cuboid Data
Build N-Dimension Cuboid Data
Build N-Dimension Cuboid Data : N-1 Dimension
Build N-Dimension Cuboid Data : N-2 Dimension
Build N-Dimension Cuboid Data : N-3 Dimension
...
... 
Build N-Dimension Cuboid Data : 0-Dimension
Prepare HFile and BulkLoad into HBase.

Preparing each N-dimension cuboid happens as a sequence of MR jobs and
intermediate results are persisted into HDFS and read by the subsequent MR
job.

I am just thinking of the following approach :
Will it be optimal to combine all the MRs into one single MR job?.
Where we can
Read the Hive Table data only once as we do now,
As part of Mapper - Emit keys according to each N-Dimension cuboids with an
identifier field (Let say C_Id) indicating for which cuboid that key belongs
to.
As part of Reducer - Aggregate based on key for each cuboid and we can
differentiate keys with C_Id field.
Prepare the HFile and Load for all the aggregated data.

Will this help to avoid reading N Dimensional cuboid result as part of the
subsequent MapReduce job to prepare N-1 Dimensional cuboid and so on. This
will have same amount of Sort / Shuffle and Reduce groups in total, But this
will help to save read IO which is going to be the input for mapper.

When we are operating with larger number of dimensions, this may help us to
avoid multi stage read IO for mapper.

Kindly take a look and If I am missing anything - please point me to that.

Thanks & Regards,
Ilamparithi M.

--
View this message in context: 
http://apache-kylin.74782.x6.nabble.com/N-Cuboids-preparation-MapReduce-Trying-to-avoid-multiple-stage-read-tp3528.html
Sent from the Apache Kylin mailing list archive at Nabble.com.

Reply via email to