Here is some of my explanation and it may not be perfect. Segment in Kylin is part of model/cube pre-computed data, in most cases, divided by date column.
Here is some difference between Segment and Snapshot. Segment, whose source data comes from one fact table joins some dimension tables with 'specific date range', is 'precomputed', and will accelerate complex query. Snapshot, whose source data comes from one specific dimension table without specific date range, is "not precomputed", and can join with segments at runtime . - https://kylin.apache.org/5.0/docs/snapshot/snapshot_management - https://kylin.apache.org/5.0/docs/modeling/load_data/segment_operation_settings/intro ------------------------ With warm regard Xiaoxiang Yu On Wed, Nov 1, 2023 at 3:53 PM Nam Đỗ Duy <na...@vnpay.vn> wrote: > Thank you again, very smart of you to automatically select cube for a > certain query. Sorry If I ask too much: Is the concept of Segment in Kylin > model similar to Slice-and-Dice concept of Cube, what is the different > between Kylin Segment and Kylin Snapshot? > > PS. I sent you the log files for your help in investigating why my cube > has not been used. > > On Wed, Nov 1, 2023 at 2:36 PM Xiaoxiang Yu <x...@apache.org> wrote: > >> I guess there is a misunderstanding from your sentences. >> >> -- 'I need to select Cube from a combo box below the query window' >> It is not right to use 'need', that combo box is for some specific >> cases(for example, Kylin did not choose a cube which is the most >> efficient), not the most cases. >> In most cases(both for Kylin 4 and Kylin 5), you don't need to select a >> Cube in the combo box, Kylin will do the choice for you. >> >> ------------------------ >> With warm regard >> Xiaoxiang Yu >> >> >> >> On Wed, Nov 1, 2023 at 3:24 PM Nam Đỗ Duy <na...@vnpay.vn.invalid> wrote: >> >>> Hi Xiaoxiang, sorry if I made you confused (Anyway, it is just a >>> question of a beginner) >>> >>> "obviously" means "clearly" >>> >>> because I need to select Cube from a combo box below the query window >>> >>> Thank you very much >>> >>> On Wed, Nov 1, 2023 at 2:20 PM Xiaoxiang Yu <x...@apache.org> wrote: >>> >>>> From my side, I cannot understand why you say Kylin 4 is 'very >>>> obviously'. Can you give an example? >>>> From the source code, the basic logic of choosing the right cube/model >>>> are similar. >>>> ------------------------ >>>> With warm regard >>>> Xiaoxiang Yu >>>> >>>> >>>> >>>> On Wed, Nov 1, 2023 at 3:01 PM Nam Đỗ Duy <na...@vnpay.vn> wrote: >>>> >>>>> Thank you for your kind reply, please answer 1 more question about >>>>> version 5: >>>>> >>>>> In version 4.x we run query against a Cube very obviously, but in >>>>> version 5, the cube usage is a implication socan you advise: for a given >>>>> query, which model will be used, which index (cube) will be used for this >>>>> query? >>>>> >>>>> Thank you >>>>> >>>>> On Wed, Nov 1, 2023 at 1:42 PM Xiaoxiang Yu <x...@apache.org> wrote: >>>>> >>>>>> 1. How do I measure the size of the index (cube) in version 5? >>>>>> You can check storage of specific Indexes from the Index page. >>>>>> >>>>>> https://kylin.apache.org/5.0/docs/modeling/model_design/aggregation_group#view-aggregate-index >>>>>> or >>>>>> https://kylin.apache.org/5.0/assets/images/index_1-6ad3f55183d4ed61962359d9408ba192.png >>>>>> >>>>>> >>>>>> 2. How to create the cardinality for each column? >>>>>> You should check this link : >>>>>> https://kylin.apache.org/5.0/docs/datasource/data_sampling/ . >>>>>> >>>>>> 3. In your default project sample named SSB project, you have only 4 >>>>>> simple aggregate group index and no table index as in attached file >>>>>> so what is the best strategy to select index for our OLAP? >>>>>> 1. There does exist a 'Base Table Index' by default actually, >>>>>> its id is 20000000001. >>>>>> 2. I think it is a good question and Kylin 5 lacks such a guide >>>>>> for better modeling. You are free to ask your question to >>>>>> mailing list and I will try to reply. >>>>>> >>>>>> ------------------------ >>>>>> With warm regard >>>>>> Xiaoxiang Yu >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Nov 1, 2023 at 2:12 PM Xiaoxiang Yu <x...@apache.org> wrote: >>>>>> >>>>>>> OK, I didn't read all the mail history so I misunderstand the >>>>>>> situation. Looks like you need to analyse >>>>>>> the cause why the query didn't hit the cube correctly. >>>>>>> >>>>>>> Please generate query diagnosis package and send it to me privately. >>>>>>> I will analyse the query log. >>>>>>> You can refer to the following steps in screenshots. >>>>>>> >>>>>>> [image: image.png] >>>>>>> >>>>>>> If the screenshots are not displaying correctly, please read this >>>>>>> guide : >>>>>>> >>>>>>> https://kylin.apache.org/5.0/docs/operations/system-operation/diagnosis/#generate-query-diagnosis-package-in-web-ui >>>>>>> >>>>>>> By the way, you need to analyse the cause by reading >>>>>>> kylin.query.log, not the kylin.log, >>>>>>> refer to >>>>>>> https://kylin.apache.org/5.0/docs/operations/logs/system_log >>>>>>> >>>>>>> ------------------------ >>>>>>> With warm regard >>>>>>> Xiaoxiang Yu >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Nov 1, 2023 at 12:18 PM Nam Đỗ Duy <na...@vnpay.vn> wrote: >>>>>>> >>>>>>>> Thank you Xiaoxiang for your advice. As my title email shown, I >>>>>>>> guessed that the OLAP functionalities has not been correctly set up in >>>>>>>> my >>>>>>>> computer. >>>>>>>> >>>>>>>> The evidence about it is that: when I disable the Pushdown option >>>>>>>> box to use solely the precomputation cube only, it showed following >>>>>>>> error: >>>>>>>> Please kindly advise how to properly build the OLAP >>>>>>>> >>>>>>>> LIMIT 500": No realization found for OLAPContext, >>>>>>>> MODEL_UNMATCHED_JOIN, >>>>>>>> rel#2240:KapTableScan.OLAP.[](table=[VNEVENT_HIVE_DWH_400MILLION_ROWS, >>>>>>>> FACTUSEREVENT],ctx=0@null,fields=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, >>>>>>>> 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Nov 1, 2023 at 10:40 AM Xiaoxiang Yu <x...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Yesterday, I tried to see if query pushdown functions work >>>>>>>>> well in the Kylin5 docker, and all of my queries return proper >>>>>>>>> responses . >>>>>>>>> After checking your logs from Shaofeng, I found these error >>>>>>>>> messages repeated many times: >>>>>>>>> 1. 'java.io.IOException: All datanodes DatanodeInfoWithStorage[ >>>>>>>>> 127.0.0.1:9866,DS-5093899b-06c7-4386-95d5-6fc271d92b52,DISK] are >>>>>>>>> bad. Aborting...' >>>>>>>>> 2. 'curator.ConnectionState : Connection timed out for >>>>>>>>> connection string (localhost:2181) and timeout (15000) / elapsed >>>>>>>>> (41794) >>>>>>>>> org.apache.curator.CuratorConnectionLossException: KeeperErrorCode >>>>>>>>> = ConnectionLoss' >>>>>>>>> >>>>>>>>> I guess the root cause is that the container didn't not have >>>>>>>>> enough resources. I found you query on a table called >>>>>>>>> 'XXX_hive_dwh_400million_rows', looks like you gave a complex query >>>>>>>>> on a >>>>>>>>> table which contains 400 million rows? >>>>>>>>> >>>>>>>>> Since I am the uploader of kylin5 's docker image, I want to >>>>>>>>> give some explainment. Kylin5 docker is not a place for performance >>>>>>>>> benchmarks, it is only for demonstration. It is only allocated with >>>>>>>>> very >>>>>>>>> little resources(8G memory) if you are using the default command from >>>>>>>>> docker hub page. Before I uploaded my image, I only tested my image >>>>>>>>> using >>>>>>>>> the ssb dataset, which the biggest table only contains about 60k >>>>>>>>> rows. If >>>>>>>>> you are using a larger dataset and complexer queries, you have to >>>>>>>>> scale the >>>>>>>>> resource properly. Try querying tables which contain not more than >>>>>>>>> 100k >>>>>>>>> rows by default. >>>>>>>>> >>>>>>>>> Here are some tips which may help you to check if the daemon >>>>>>>>> service is in health status and resources(particularly disk space) is >>>>>>>>> configured properly. >>>>>>>>> >>>>>>>>> 1. Checking HDFS 's web ui( >>>>>>>>> http://localhost:9870/dfshealth.html#tab-datanode ) to confirm >>>>>>>>> whether HDFS service is in 'In service' status. >>>>>>>>> 2. Checking Datanode 's log in >>>>>>>>> `/opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log`, >>>>>>>>> check if >>>>>>>>> there is any error message. Like: cat >>>>>>>>> /opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log | grep >>>>>>>>> ERROR >>>>>>>>> | wc -l >>>>>>>>> 3. Checking if your docker engine is configured with enough >>>>>>>>> disk space, if you are using Docker Desktop like me,please go to >>>>>>>>> "Settings" >>>>>>>>> - "Resources" - "Advanced", make sure you have allocated 40GB+ disk >>>>>>>>> space >>>>>>>>> to the docker container. >>>>>>>>> 4. Checking the available disk space of your container by `df >>>>>>>>> -h`, make sure the 'Use%' of 'overlay' is less than 60% . >>>>>>>>> 5. Checking the load average/ cpu usage/ jvm gc. Make sure >>>>>>>>> these metrics are not really high when you send a query. >>>>>>>>> ------------------------ >>>>>>>>> With warm regard >>>>>>>>> Xiaoxiang Yu >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Oct 31, 2023 at 5:13 PM Nam Đỗ Duy <na...@vnpay.vn.invalid> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi ShaoFeng >>>>>>>>>> >>>>>>>>>> Thank you very much for your valuable feedback >>>>>>>>>> >>>>>>>>>> I saw the application to be there (if I see it right) as in the >>>>>>>>>> attachment photo. Kindly advise so that I can run this query on OLAP. >>>>>>>>>> >>>>>>>>>> PS. I sent you the log file in private. >>>>>>>>>> >>>>>>>>>> [image: image.png] >>>>>>>>>> >>>>>>>>>> On Tue, Oct 31, 2023 at 3:11 PM ShaoFeng Shi < >>>>>>>>>> shaofeng...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> Can you provide the messages in logs/kylin.log when executing >>>>>>>>>>> the SQL? and you can also check the Spark UI from yarn resource >>>>>>>>>>> manager >>>>>>>>>>> (there should be one running application called Spardar, which is >>>>>>>>>>> Kylin's >>>>>>>>>>> backend spark application). If the application is not there, it may >>>>>>>>>>> indicates the yarn doesn't have resource to startup it. >>>>>>>>>>> >>>>>>>>>>> Best regards, >>>>>>>>>>> >>>>>>>>>>> Shaofeng Shi 史少锋 >>>>>>>>>>> Apache Kylin PMC, >>>>>>>>>>> Apache Incubator PMC, >>>>>>>>>>> Email: shaofeng...@apache.org >>>>>>>>>>> >>>>>>>>>>> Apache Kylin FAQ: >>>>>>>>>>> https://kylin.apache.org/docs/gettingstarted/faq.html >>>>>>>>>>> Join Kylin user mail group: user-subscr...@kylin.apache.org >>>>>>>>>>> Join Kylin dev mail group: dev-subscr...@kylin.apache.org >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Nam Đỗ Duy <na...@vnpay.vn> 于2023年10月31日周二 10:35写道: >>>>>>>>>>> >>>>>>>>>>>> Dear Sir/Madam, >>>>>>>>>>>> >>>>>>>>>>>> I have a fact with 500million rows then I build model, index >>>>>>>>>>>> according to the website help. >>>>>>>>>>>> >>>>>>>>>>>> I chose full incremental because this is the first times I load >>>>>>>>>>>> data >>>>>>>>>>>> >>>>>>>>>>>> I create both index types Aggregate group index, table index as >>>>>>>>>>>> photo attached. >>>>>>>>>>>> >>>>>>>>>>>> But the query always failed after timeout of 300 seconds (I run >>>>>>>>>>>> in docker), I dont want to increase the value of 300 seconds >>>>>>>>>>>> because I wish >>>>>>>>>>>> the OLAP can run within 1 minutes (is that possible?) >>>>>>>>>>>> >>>>>>>>>>>> It seems that the OLAP function in indexing not working to >>>>>>>>>>>> speedup the query by precomputed cube. >>>>>>>>>>>> >>>>>>>>>>>> Can you advise to check whether the index did really work? >>>>>>>>>>>> >>>>>>>>>>>> It is quite urgent task for me so prompt response is highly >>>>>>>>>>>> appreciated. >>>>>>>>>>>> >>>>>>>>>>>> Thank you very much >>>>>>>>>>>> >>>>>>>>>>>