Hi Kylin Team, I am a new user of Apache Kylin who started exploring for our MOLAP requirements. Thanks so much for Kylin community for such an offering.
While looking at the way Kylin prepares the offline data, Create Intermediate Flat Hive Table Extract Fact Table Distinct Columns Build Dimension Dictionary Build Base Cuboid Data Build N-Dimension Cuboid Data Build N-Dimension Cuboid Data : N-1 Dimension Build N-Dimension Cuboid Data : N-2 Dimension Build N-Dimension Cuboid Data : N-3 Dimension ... ... Build N-Dimension Cuboid Data : 0-Dimension Prepare HFile and BulkLoad into HBase. Preparing each N-dimension cuboid happens as a sequence of MR jobs and intermediate results are persisted into HDFS and read by the subsequent MR job. I am just thinking of the following approach : Will it be optimal to combine all the MRs into one single MR job?. Where we can Read the Hive Table data only once as we do now, As part of Mapper - Emit keys according to each N-Dimension cuboids with an identifier field (Let say C_Id) indicating for which cuboid that key belongs to. As part of Reducer - Aggregate based on key for each cuboid and we can differentiate keys with C_Id field. Prepare the HFile and Load for all the aggregated data. Will this help to avoid reading N Dimensional cuboid result as part of the subsequent MapReduce job to prepare N-1 Dimensional cuboid and so on. This will have same amount of Sort / Shuffle and Reduce groups in total, But this will help to save read IO which is going to be the input for mapper. When we are operating with larger number of dimensions, this may help us to avoid multi stage read IO for mapper. Kindly take a look and If I am missing anything - please point me to that. Thanks & Regards, Ilamparithi M. -- View this message in context: http://apache-kylin.74782.x6.nabble.com/N-Cuboids-preparation-MapReduce-Trying-to-avoid-multiple-stage-read-tp3528.html Sent from the Apache Kylin mailing list archive at Nabble.com.