Hello Ravion,
Indeed Kylin generates a MOLAP cube from data source tables (Hive tables, or also other systems like Kafka queues or JDBC-MySQL, Oracle...). In a Kylin project, data sources are defined in "Data Sources" section and then a "Data Model" has to be created where the relationship between the source tables (joins in the star schema or level flake) is indicated, as well as the columns of each table that will be used as dimensions and those that will be used as measurements. After this, the last metadata layer "Cube" is defined, which is closely related to the generation and storage of the MOLAP cube in HBase. After the first construction, the generated MOLAP cube is stored in HBase. The size of this generated MOLAP cube therefore depends on the definition of the "Cube", where the level of pre-aggregation of the data stored in the MOLAP cube is determined by means of different concepts (e.g. Normal or Derived dimensions). For example, I have 2 Kylin Cubes mounted on Data Model which is a DW in Hive. This DW fact table sizes 1 Gb (ORC format and compression) Snappy. One of the generated kylin cubes sizes 1 Gb, that is, almost the same size as the DW in Hive font (1 Gb Hive + 1 Cube in HBase). However, other generated Kylin cube, with different cube definition over same Data Model, sizes 10 Gb. This bigger size is due to I defined more dimensions as Normal type in Kylin cube definition, in order to achieve better results in querying times. I'm hoping to clear up the doubts for you. Best Regards, Roberto Tardío Olmos Head of Big Data Analytics Avenida de Brasil, 17, Planta 16.28020 Madrid Fijo: 91.788.34.10 http://bigdata.stratebi.com/ http://www.stratebi.com <http://www.stratebi.com/> From: ☼ R Nair [mailto:[email protected]] Sent: sábado, 1 de septiembre de 2018 19:50 To: [email protected] Subject: Data Duplication Hi all, I am new to Kylin. So here is a fundamental question: When I create a cube, as its MOLAP, I believe that irrespectivve of the already existing data in HBase, Kylin will create a copy of the data in a cube/multidimensional format (separate from the underlying Base data) to help slice/dice faster. Any idea on size of the duplicate copy created? Thanks Best, Ravion
