Thank you Xiaoxiang for the guideline. Will definitely read it carefully. Kindly help the following questions:
1. Computed column I created a “computed column” and add it to dimensions (among other dimensions) When I use query to select the computed column it returned error 2. Datatype optimization: will you think that the int be better than string for key join columns? Please advise On Wed, 1 Nov 2023 at 17:32 Xiaoxiang Yu <x...@apache.org> wrote: > Yes, that is almost correct. > > If you have a lot of complex queries, and you want to using Kylin 5 to > accelerate them, the recommended steps of mine are as follows: > > 1. You analyse all queries and collect all join relation/pattern. > 2. You create Models for each specific join relation/pattern, with the join > relation you find in above step. > 3. You analyse and collect dimensions and measures from all queries, and > add them to the corresponding Model. > 4. You build segments of all Models with proper data range. > 5. You turned off the pushdown switch, and sent all queries to Kylin. If > there are some queries which failed, fix them. > Here are some common situations. > 5.1 Join relation/pattern is not matched > 5.2 If the join relation is matched, the Model might not contain every > column that your query needs, please check kylin.query.log with keyword ' > unmatched'. > 6. (Optional) If you find some of your queries do not exactly match with > your Index(your query on [colA, colB], but your index contains more columns > than colA and colB), you can add some aggregate groups(or smaller Table > Index) to optimize the query performance. > > > > ------------------------ > With warm regard > Xiaoxiang Yu > > > > On Wed, Nov 1, 2023 at 5:57 PM Nam Đỗ Duy <na...@vnpay.vn.invalid> wrote: > > > Thank you Xiaoxiang, I nearly got to the point. > > > > So can I interpret that: 1 model equal (~) to a set of Joins of > (Dim/Fact) > > table, that is to say we need to create several models according to > > multiple kinds of joins queries? > > > > Best regards > > > > On Wed, Nov 1, 2023 at 4:50 PM Xiaoxiang Yu <x...@apache.org> wrote: > > > >> Have you ever tried to analyse the reason why your query can not hit > >> Model 'sample_ssb'? > >> It is because the join relation of your query is not suitable for the > >> join relation/pattern of Model 'sample_ssb'. > >> > >> Your query used a join relation/pattern like: A inner join B. > >> But the Model 'sample_ssb' used a join relation/pattern like : A inner > >> join B inner join C. > >> > >> If you are familiar with the definition of Inner join, you may know that > >> the > >> relation/pattern 'A inner join B inner join C' will have a chance > >> to lose some rows when compared to pattern 'A inner join B'. > >> So the Model 'sample_ssb' will be excluded to serve your query. > >> > >> That is to say, you need to create a new model that is similar to Model > >> 'sample_ssb', > >> but with additional tables removed. > >> > >> > >> > >> ------------------------ > >> With warm regard > >> Xiaoxiang Yu > >> > >> > >> > >> On Wed, Nov 1, 2023 at 5:21 PM Nam Đỗ Duy <na...@vnpay.vn.invalid> > wrote: > >> > >>> Hi Xiaoxiang, > >>> > >>> Thank you very much > >>> > >>> I have clearer picture of Kylin already thanks to your explanation. > >>> > >>> Now back to the sample project of SSB in attached photo, when I run > this > >>> query with push_down option OFF, why the OLAP error appears, and in > such > >>> case, how to create a new cube for this query? > >>> > >>> [image: image.png] > >>> > >>> On Wed, Nov 1, 2023 at 3:49 PM Xiaoxiang Yu <x...@apache.org> wrote: > >>> > >>>> Here is some of my explanation and it may not be perfect. > >>>> Segment in Kylin is part of model/cube pre-computed data, in most > >>>> cases, divided by date column. > >>>> > >>>> Here is some difference between Segment and Snapshot. > >>>> Segment, whose source data comes from one fact table joins some > dimension > >>>> tables with 'specific date range', is 'precomputed', and will > accelerate > >>>> complex query. > >>>> Snapshot, whose source data comes from one specific dimension table > without > >>>> specific date range, is "not precomputed", and can join with segments > >>>> at runtime . > >>>> > >>>> - https://kylin.apache.org/5.0/docs/snapshot/snapshot_management > >>>> - > >>>> > https://kylin.apache.org/5.0/docs/modeling/load_data/segment_operation_settings/intro > >>>> > >>>> ------------------------ > >>>> With warm regard > >>>> Xiaoxiang Yu > >>>> > >>>> > >>>> > >>>> On Wed, Nov 1, 2023 at 3:53 PM Nam Đỗ Duy <na...@vnpay.vn> wrote: > >>>> > >>>>> Thank you again, very smart of you to automatically select cube for a > >>>>> certain query. Sorry If I ask too much: Is the concept of Segment in > Kylin > >>>>> model similar to Slice-and-Dice concept of Cube, what is the > different > >>>>> between Kylin Segment and Kylin Snapshot? > >>>>> > >>>>> PS. I sent you the log files for your help in investigating why my > >>>>> cube has not been used. > >>>>> > >>>>> On Wed, Nov 1, 2023 at 2:36 PM Xiaoxiang Yu <x...@apache.org> wrote: > >>>>> > >>>>>> I guess there is a misunderstanding from your sentences. > >>>>>> > >>>>>> -- 'I need to select Cube from a combo box below the query window' > >>>>>> It is not right to use 'need', that combo box is for some specific > >>>>>> cases(for example, Kylin did not choose a cube which is the most > >>>>>> efficient), not the most cases. > >>>>>> In most cases(both for Kylin 4 and Kylin 5), you don't need to > select > >>>>>> a Cube in the combo box, Kylin will do the choice for you. > >>>>>> > >>>>>> ------------------------ > >>>>>> With warm regard > >>>>>> Xiaoxiang Yu > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Wed, Nov 1, 2023 at 3:24 PM Nam Đỗ Duy <na...@vnpay.vn.invalid> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi Xiaoxiang, sorry if I made you confused (Anyway, it is just a > >>>>>>> question of a beginner) > >>>>>>> > >>>>>>> "obviously" means "clearly" > >>>>>>> > >>>>>>> because I need to select Cube from a combo box below the query > window > >>>>>>> > >>>>>>> Thank you very much > >>>>>>> > >>>>>>> On Wed, Nov 1, 2023 at 2:20 PM Xiaoxiang Yu <x...@apache.org> > wrote: > >>>>>>> > >>>>>>>> From my side, I cannot understand why you say Kylin 4 is 'very > >>>>>>>> obviously'. Can you give an example? > >>>>>>>> From the source code, the basic logic of choosing the right > >>>>>>>> cube/model are similar. > >>>>>>>> ------------------------ > >>>>>>>> With warm regard > >>>>>>>> Xiaoxiang Yu > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Wed, Nov 1, 2023 at 3:01 PM Nam Đỗ Duy <na...@vnpay.vn> wrote: > >>>>>>>> > >>>>>>>>> Thank you for your kind reply, please answer 1 more question > about > >>>>>>>>> version 5: > >>>>>>>>> > >>>>>>>>> In version 4.x we run query against a Cube very obviously, but in > >>>>>>>>> version 5, the cube usage is a implication socan you advise: for > a given > >>>>>>>>> query, which model will be used, which index (cube) will be used > for this > >>>>>>>>> query? > >>>>>>>>> > >>>>>>>>> Thank you > >>>>>>>>> > >>>>>>>>> On Wed, Nov 1, 2023 at 1:42 PM Xiaoxiang Yu <x...@apache.org> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> 1. How do I measure the size of the index (cube) in version 5? > >>>>>>>>>> You can check storage of specific Indexes from the Index > page. > >>>>>>>>>> > >>>>>>>>>> > https://kylin.apache.org/5.0/docs/modeling/model_design/aggregation_group#view-aggregate-index > >>>>>>>>>> or > >>>>>>>>>> > https://kylin.apache.org/5.0/assets/images/index_1-6ad3f55183d4ed61962359d9408ba192.png > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> 2. How to create the cardinality for each column? > >>>>>>>>>> You should check this link : > >>>>>>>>>> https://kylin.apache.org/5.0/docs/datasource/data_sampling/ . > >>>>>>>>>> > >>>>>>>>>> 3. In your default project sample named SSB project, you have > >>>>>>>>>> only 4 simple aggregate group index and no table index as in > attached file > >>>>>>>>>> so what is the best strategy to select index for our OLAP? > >>>>>>>>>> 1. There does exist a 'Base Table Index' by default > >>>>>>>>>> actually, its id is 20000000001. > >>>>>>>>>> 2. I think it is a good question and Kylin 5 lacks such a > >>>>>>>>>> guide for better modeling. You are free to ask your question to > >>>>>>>>>> mailing list and I will try to reply. > >>>>>>>>>> > >>>>>>>>>> ------------------------ > >>>>>>>>>> With warm regard > >>>>>>>>>> Xiaoxiang Yu > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Wed, Nov 1, 2023 at 2:12 PM Xiaoxiang Yu <x...@apache.org> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> OK, I didn't read all the mail history so I misunderstand the > >>>>>>>>>>> situation. Looks like you need to analyse > >>>>>>>>>>> the cause why the query didn't hit the cube correctly. > >>>>>>>>>>> > >>>>>>>>>>> Please generate query diagnosis package and send it to me > >>>>>>>>>>> privately. I will analyse the query log. > >>>>>>>>>>> You can refer to the following steps in screenshots. > >>>>>>>>>>> > >>>>>>>>>>> [image: image.png] > >>>>>>>>>>> > >>>>>>>>>>> If the screenshots are not displaying correctly, please read > >>>>>>>>>>> this guide : > >>>>>>>>>>> > >>>>>>>>>>> > https://kylin.apache.org/5.0/docs/operations/system-operation/diagnosis/#generate-query-diagnosis-package-in-web-ui > >>>>>>>>>>> > >>>>>>>>>>> By the way, you need to analyse the cause by reading > >>>>>>>>>>> kylin.query.log, not the kylin.log, > >>>>>>>>>>> refer to > >>>>>>>>>>> https://kylin.apache.org/5.0/docs/operations/logs/system_log > >>>>>>>>>>> > >>>>>>>>>>> ------------------------ > >>>>>>>>>>> With warm regard > >>>>>>>>>>> Xiaoxiang Yu > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Nov 1, 2023 at 12:18 PM Nam Đỗ Duy <na...@vnpay.vn> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Thank you Xiaoxiang for your advice. As my title email shown, > I > >>>>>>>>>>>> guessed that the OLAP functionalities has not been correctly > set up in my > >>>>>>>>>>>> computer. > >>>>>>>>>>>> > >>>>>>>>>>>> The evidence about it is that: when I disable the Pushdown > >>>>>>>>>>>> option box to use solely the precomputation cube only, it > showed following > >>>>>>>>>>>> error: Please kindly advise how to properly build the OLAP > >>>>>>>>>>>> > >>>>>>>>>>>> LIMIT 500": No realization found for OLAPContext, > MODEL_UNMATCHED_JOIN, > rel#2240:KapTableScan.OLAP.[](table=[VNEVENT_HIVE_DWH_400MILLION_ROWS, > FACTUSEREVENT],ctx=0@null,fields=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, > 12, 13, 14, 15, 16, 17, 18, 19, 20]) > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, Nov 1, 2023 at 10:40 AM Xiaoxiang Yu <x...@apache.org > > > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Yesterday, I tried to see if query pushdown functions > work > >>>>>>>>>>>>> well in the Kylin5 docker, and all of my queries return > proper responses . > >>>>>>>>>>>>> After checking your logs from Shaofeng, I found these > >>>>>>>>>>>>> error messages repeated many times: > >>>>>>>>>>>>> 1. 'java.io.IOException: All datanodes > >>>>>>>>>>>>> DatanodeInfoWithStorage[127.0.0.1:9866 > ,DS-5093899b-06c7-4386-95d5-6fc271d92b52,DISK] > >>>>>>>>>>>>> are bad. Aborting...' > >>>>>>>>>>>>> 2. 'curator.ConnectionState : Connection timed out for > >>>>>>>>>>>>> connection string (localhost:2181) and timeout (15000) / > elapsed (41794) > >>>>>>>>>>>>> org.apache.curator.CuratorConnectionLossException: > >>>>>>>>>>>>> KeeperErrorCode = ConnectionLoss' > >>>>>>>>>>>>> > >>>>>>>>>>>>> I guess the root cause is that the container didn't not > >>>>>>>>>>>>> have enough resources. I found you query on a table called > >>>>>>>>>>>>> 'XXX_hive_dwh_400million_rows', looks like you gave a > complex query on a > >>>>>>>>>>>>> table which contains 400 million rows? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Since I am the uploader of kylin5 's docker image, I want > >>>>>>>>>>>>> to give some explainment. Kylin5 docker is not a place for > performance > >>>>>>>>>>>>> benchmarks, it is only for demonstration. It is only > allocated with very > >>>>>>>>>>>>> little resources(8G memory) if you are using the default > command from > >>>>>>>>>>>>> docker hub page. Before I uploaded my image, I only tested > my image using > >>>>>>>>>>>>> the ssb dataset, which the biggest table only contains about > 60k rows. If > >>>>>>>>>>>>> you are using a larger dataset and complexer queries, you > have to scale the > >>>>>>>>>>>>> resource properly. Try querying tables which contain not > more than 100k > >>>>>>>>>>>>> rows by default. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Here are some tips which may help you to check if the > >>>>>>>>>>>>> daemon service is in health status and > resources(particularly disk space) > >>>>>>>>>>>>> is configured properly. > >>>>>>>>>>>>> > >>>>>>>>>>>>> 1. Checking HDFS 's web ui( > >>>>>>>>>>>>> http://localhost:9870/dfshealth.html#tab-datanode ) to > >>>>>>>>>>>>> confirm whether HDFS service is in 'In service' status. > >>>>>>>>>>>>> 2. Checking Datanode 's log in > >>>>>>>>>>>>> > `/opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log`, check if > >>>>>>>>>>>>> there is any error message. Like: cat > >>>>>>>>>>>>> > /opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log | grep ERROR > >>>>>>>>>>>>> | wc -l > >>>>>>>>>>>>> 3. Checking if your docker engine is configured with > >>>>>>>>>>>>> enough disk space, if you are using Docker Desktop like > me,please go to > >>>>>>>>>>>>> "Settings" - "Resources" - "Advanced", make sure you have > allocated 40GB+ > >>>>>>>>>>>>> disk space to the docker container. > >>>>>>>>>>>>> 4. Checking the available disk space of your container by > >>>>>>>>>>>>> `df -h`, make sure the 'Use%' of 'overlay' is less than 60% . > >>>>>>>>>>>>> 5. Checking the load average/ cpu usage/ jvm gc. Make > sure > >>>>>>>>>>>>> these metrics are not really high when you send a query. > >>>>>>>>>>>>> ------------------------ > >>>>>>>>>>>>> With warm regard > >>>>>>>>>>>>> Xiaoxiang Yu > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Tue, Oct 31, 2023 at 5:13 PM Nam Đỗ Duy > >>>>>>>>>>>>> <na...@vnpay.vn.invalid> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi ShaoFeng > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thank you very much for your valuable feedback > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I saw the application to be there (if I see it right) as in > >>>>>>>>>>>>>> the attachment photo. Kindly advise so that I can run this > query on OLAP. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> PS. I sent you the log file in private. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [image: image.png] > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Tue, Oct 31, 2023 at 3:11 PM ShaoFeng Shi < > >>>>>>>>>>>>>> shaofeng...@apache.org> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Can you provide the messages in logs/kylin.log when > >>>>>>>>>>>>>>> executing the SQL? and you can also check the Spark UI > from yarn resource > >>>>>>>>>>>>>>> manager (there should be one running application called > Spardar, which is > >>>>>>>>>>>>>>> Kylin's backend spark application). If the application is > not there, it may > >>>>>>>>>>>>>>> indicates the yarn doesn't have resource to startup it. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Best regards, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Shaofeng Shi 史少锋 > >>>>>>>>>>>>>>> Apache Kylin PMC, > >>>>>>>>>>>>>>> Apache Incubator PMC, > >>>>>>>>>>>>>>> Email: shaofeng...@apache.org > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Apache Kylin FAQ: > >>>>>>>>>>>>>>> https://kylin.apache.org/docs/gettingstarted/faq.html > >>>>>>>>>>>>>>> Join Kylin user mail group: > user-subscr...@kylin.apache.org > >>>>>>>>>>>>>>> Join Kylin dev mail group: dev-subscr...@kylin.apache.org > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Nam Đỗ Duy <na...@vnpay.vn> 于2023年10月31日周二 10:35写道: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Dear Sir/Madam, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I have a fact with 500million rows then I build model, > >>>>>>>>>>>>>>>> index according to the website help. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I chose full incremental because this is the first times I > >>>>>>>>>>>>>>>> load data > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I create both index types Aggregate group index, table > >>>>>>>>>>>>>>>> index as photo attached. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> But the query always failed after timeout of 300 seconds > (I > >>>>>>>>>>>>>>>> run in docker), I dont want to increase the value of 300 > seconds because I > >>>>>>>>>>>>>>>> wish the OLAP can run within 1 minutes (is that possible?) > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> It seems that the OLAP function in indexing not working to > >>>>>>>>>>>>>>>> speedup the query by precomputed cube. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Can you advise to check whether the index did really work? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> It is quite urgent task for me so prompt response is > highly > >>>>>>>>>>>>>>>> appreciated. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thank you very much > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >