Re: OLAP functionalities in Kylin 5.0 seems not yet working for me

Xiaoxiang Yu Wed, 01 Nov 2023 03:32:29 -0700

Yes, that is almost correct.

If you have a lot of complex queries, and you want to using Kylin 5 to
accelerate them, the recommended steps of mine are as follows:


1. You analyse all queries and collect all join relation/pattern.
2. You create Models for each specific join relation/pattern, with the join
relation you find in above step.
3. You analyse and collect dimensions and measures from all queries, and
add them to the corresponding Model.
4. You build segments of all Models with proper data range.
5. You turned off the pushdown switch, and sent all queries to Kylin. If
there are some queries which failed, fix them.
    Here are some common situations.
    5.1 Join relation/pattern is not matched
    5.2 If the join relation is matched, the Model might not contain every
column that your query needs, please check kylin.query.log with keyword '
unmatched'.
6. (Optional) If you find some of your queries do not exactly match with
your Index(your query on [colA, colB], but your index contains more columns
than colA and colB), you can add some aggregate groups(or smaller Table
Index) to optimize the query performance.



------------------------
With warm regard
Xiaoxiang Yu



On Wed, Nov 1, 2023 at 5:57 PM Nam Đỗ Duy <[email protected]> wrote:

> Thank you Xiaoxiang, I nearly got to the point.
>
> So can I interpret that: 1 model equal (~) to a set of Joins of (Dim/Fact)
> table, that is to say we need to create several models according to
> multiple kinds of joins queries?
>
> Best regards
>
> On Wed, Nov 1, 2023 at 4:50 PM Xiaoxiang Yu <[email protected]> wrote:
>
>> Have you ever tried to analyse the reason why your query can not hit
>> Model 'sample_ssb'?
>> It is because the join relation of your query is not suitable for the
>> join relation/pattern of  Model 'sample_ssb'.
>>
>> Your query used a join relation/pattern like: A inner join B.
>> But the Model 'sample_ssb' used a join relation/pattern like : A inner
>> join B inner join C.
>>
>> If you are familiar with the definition of Inner join, you may know that
>> the
>> relation/pattern 'A inner join B inner join C' will have a chance
>> to lose some rows when compared to pattern 'A inner join B'.
>> So the Model 'sample_ssb' will be excluded to serve your query.
>>
>> That is to say, you need to create a new model that is similar to Model
>> 'sample_ssb',
>>  but with additional tables removed.
>>
>>
>>
>> ------------------------
>> With warm regard
>> Xiaoxiang Yu
>>
>>
>>
>> On Wed, Nov 1, 2023 at 5:21 PM Nam Đỗ Duy <[email protected]> wrote:
>>
>>> Hi Xiaoxiang,
>>>
>>> Thank you very much
>>>
>>> I have clearer picture of Kylin already thanks to your explanation.
>>>
>>> Now back to the sample project of SSB in attached photo, when I run this
>>> query with push_down option OFF, why the OLAP error appears, and in such
>>> case, how to create a new cube for this query?
>>>
>>> [image: image.png]
>>>
>>> On Wed, Nov 1, 2023 at 3:49 PM Xiaoxiang Yu <[email protected]> wrote:
>>>
>>>> Here is some of my explanation and it may not be perfect.
>>>> Segment in Kylin is part of model/cube pre-computed data, in most
>>>> cases, divided by date column.
>>>>
>>>> Here is some difference between Segment and Snapshot.
>>>> Segment, whose source data comes from one fact table joins some dimension
>>>> tables with 'specific date range', is 'precomputed', and will accelerate
>>>> complex query.
>>>> Snapshot, whose source data comes from one specific dimension table without
>>>> specific date range, is "not precomputed", and can join with segments
>>>> at runtime .
>>>>
>>>> - https://kylin.apache.org/5.0/docs/snapshot/snapshot_management
>>>> -
>>>> https://kylin.apache.org/5.0/docs/modeling/load_data/segment_operation_settings/intro
>>>>
>>>> ------------------------
>>>> With warm regard
>>>> Xiaoxiang Yu
>>>>
>>>>
>>>>
>>>> On Wed, Nov 1, 2023 at 3:53 PM Nam Đỗ Duy <[email protected]> wrote:
>>>>
>>>>> Thank you again, very smart of you to automatically select cube for a
>>>>> certain query. Sorry If I ask too much: Is the concept of Segment in Kylin
>>>>> model similar to Slice-and-Dice concept of Cube, what is the different
>>>>> between Kylin Segment and Kylin Snapshot?
>>>>>
>>>>> PS. I sent you the log files for your help in investigating why my
>>>>> cube has not been used.
>>>>>
>>>>> On Wed, Nov 1, 2023 at 2:36 PM Xiaoxiang Yu <[email protected]> wrote:
>>>>>
>>>>>> I guess there is a misunderstanding from your sentences.
>>>>>>
>>>>>> -- 'I need to select Cube from a combo box below the query window'
>>>>>> It is not right to use 'need', that combo box is for some specific
>>>>>> cases(for example, Kylin did not choose a cube which is the most
>>>>>> efficient), not the most cases.
>>>>>> In most cases(both for Kylin 4 and Kylin 5), you don't need to select
>>>>>> a Cube in the combo box, Kylin will do the choice for you.
>>>>>>
>>>>>> ------------------------
>>>>>> With warm regard
>>>>>> Xiaoxiang Yu
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 1, 2023 at 3:24 PM Nam Đỗ Duy <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Xiaoxiang, sorry if I made you confused (Anyway, it is just a
>>>>>>> question of a beginner)
>>>>>>>
>>>>>>> "obviously" means "clearly"
>>>>>>>
>>>>>>> because I need to select Cube from a combo box below the query window
>>>>>>>
>>>>>>> Thank you very much
>>>>>>>
>>>>>>> On Wed, Nov 1, 2023 at 2:20 PM Xiaoxiang Yu <[email protected]> wrote:
>>>>>>>
>>>>>>>> From my side, I cannot understand why you say Kylin 4 is 'very
>>>>>>>> obviously'. Can you give an example?
>>>>>>>> From the source code, the basic logic of choosing the right
>>>>>>>> cube/model are similar.
>>>>>>>> ------------------------
>>>>>>>> With warm regard
>>>>>>>> Xiaoxiang Yu
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 1, 2023 at 3:01 PM Nam Đỗ Duy <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thank you for your kind reply, please answer 1 more question about
>>>>>>>>> version 5:
>>>>>>>>>
>>>>>>>>> In version 4.x we run query against a Cube very obviously, but in
>>>>>>>>> version 5, the cube usage is a implication socan you advise: for a 
>>>>>>>>> given
>>>>>>>>> query, which model will be used, which index (cube) will be used for 
>>>>>>>>> this
>>>>>>>>> query?
>>>>>>>>>
>>>>>>>>> Thank you
>>>>>>>>>
>>>>>>>>> On Wed, Nov 1, 2023 at 1:42 PM Xiaoxiang Yu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> 1. How do I measure the size of the index (cube) in version 5?
>>>>>>>>>>    You can check storage of specific Indexes from the Index page.
>>>>>>>>>>
>>>>>>>>>> https://kylin.apache.org/5.0/docs/modeling/model_design/aggregation_group#view-aggregate-index
>>>>>>>>>> or
>>>>>>>>>> https://kylin.apache.org/5.0/assets/images/index_1-6ad3f55183d4ed61962359d9408ba192.png
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2. How to create the cardinality for each column?
>>>>>>>>>>    You should check this link :
>>>>>>>>>> https://kylin.apache.org/5.0/docs/datasource/data_sampling/ .
>>>>>>>>>>
>>>>>>>>>> 3. In your default project sample named SSB project, you have
>>>>>>>>>> only 4 simple aggregate group index and no table index as in 
>>>>>>>>>> attached file
>>>>>>>>>> so what is the best strategy to select index for our OLAP?
>>>>>>>>>>     1. There does exist a 'Base Table Index'  by default
>>>>>>>>>> actually, its id is 20000000001.
>>>>>>>>>>     2. I think it is a good question and Kylin 5 lacks such a
>>>>>>>>>> guide for better modeling. You are free to ask your question to
>>>>>>>>>> mailing list and I will try to reply.
>>>>>>>>>>
>>>>>>>>>> ------------------------
>>>>>>>>>> With warm regard
>>>>>>>>>> Xiaoxiang Yu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 1, 2023 at 2:12 PM Xiaoxiang Yu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> OK, I didn't read all the mail history so I misunderstand the
>>>>>>>>>>> situation. Looks like you need to analyse
>>>>>>>>>>> the cause why the query didn't hit the cube correctly.
>>>>>>>>>>>
>>>>>>>>>>> Please generate query diagnosis package and send it to me
>>>>>>>>>>> privately. I will analyse the query log.
>>>>>>>>>>> You can refer to the following steps in screenshots.
>>>>>>>>>>>
>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>
>>>>>>>>>>> If the screenshots are not displaying correctly, please read
>>>>>>>>>>> this guide :
>>>>>>>>>>>
>>>>>>>>>>> https://kylin.apache.org/5.0/docs/operations/system-operation/diagnosis/#generate-query-diagnosis-package-in-web-ui
>>>>>>>>>>>
>>>>>>>>>>> By the way, you need to analyse the cause by reading
>>>>>>>>>>> kylin.query.log, not the kylin.log,
>>>>>>>>>>> refer to
>>>>>>>>>>> https://kylin.apache.org/5.0/docs/operations/logs/system_log
>>>>>>>>>>>
>>>>>>>>>>> ------------------------
>>>>>>>>>>> With warm regard
>>>>>>>>>>> Xiaoxiang Yu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 1, 2023 at 12:18 PM Nam Đỗ Duy <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you Xiaoxiang for your advice. As my title email shown, I
>>>>>>>>>>>> guessed that the OLAP functionalities has not been correctly set 
>>>>>>>>>>>> up in my
>>>>>>>>>>>> computer.
>>>>>>>>>>>>
>>>>>>>>>>>> The evidence about it is that: when I disable the Pushdown
>>>>>>>>>>>> option box to use solely the precomputation cube only, it showed 
>>>>>>>>>>>> following
>>>>>>>>>>>> error: Please kindly advise how to properly build the OLAP
>>>>>>>>>>>>
>>>>>>>>>>>> LIMIT 500": No realization found for OLAPContext, 
>>>>>>>>>>>> MODEL_UNMATCHED_JOIN, 
>>>>>>>>>>>> rel#2240:KapTableScan.OLAP.[](table=[VNEVENT_HIVE_DWH_400MILLION_ROWS,
>>>>>>>>>>>>  FACTUSEREVENT],ctx=0@null,fields=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
>>>>>>>>>>>> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 1, 2023 at 10:40 AM Xiaoxiang Yu <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>>     Yesterday, I tried to see if query pushdown functions work
>>>>>>>>>>>>> well in the Kylin5 docker, and all of my queries return proper 
>>>>>>>>>>>>> responses .
>>>>>>>>>>>>>     After checking your logs from Shaofeng, I found these
>>>>>>>>>>>>> error messages repeated many times:
>>>>>>>>>>>>>     1. 'java.io.IOException: All datanodes
>>>>>>>>>>>>> DatanodeInfoWithStorage[127.0.0.1:9866,DS-5093899b-06c7-4386-95d5-6fc271d92b52,DISK]
>>>>>>>>>>>>> are bad. Aborting...'
>>>>>>>>>>>>>     2. 'curator.ConnectionState : Connection timed out for
>>>>>>>>>>>>> connection string (localhost:2181) and timeout (15000) / elapsed 
>>>>>>>>>>>>> (41794)
>>>>>>>>>>>>> org.apache.curator.CuratorConnectionLossException:
>>>>>>>>>>>>> KeeperErrorCode = ConnectionLoss'
>>>>>>>>>>>>>
>>>>>>>>>>>>>     I guess the root cause is that the container didn't not
>>>>>>>>>>>>> have enough resources. I found you query on a table called
>>>>>>>>>>>>> 'XXX_hive_dwh_400million_rows', looks like you gave a complex 
>>>>>>>>>>>>> query on a
>>>>>>>>>>>>> table which contains 400 million rows?
>>>>>>>>>>>>>
>>>>>>>>>>>>>     Since I am the uploader of kylin5 's docker image, I want
>>>>>>>>>>>>> to give some explainment. Kylin5 docker is not a place for 
>>>>>>>>>>>>> performance
>>>>>>>>>>>>> benchmarks, it is only for demonstration. It is only allocated 
>>>>>>>>>>>>> with very
>>>>>>>>>>>>> little resources(8G memory) if you are using the default command 
>>>>>>>>>>>>> from
>>>>>>>>>>>>> docker hub page. Before I uploaded my image, I only tested my 
>>>>>>>>>>>>> image using
>>>>>>>>>>>>> the ssb dataset, which the biggest table only contains about 60k 
>>>>>>>>>>>>> rows. If
>>>>>>>>>>>>> you are using a larger dataset and complexer queries, you have to 
>>>>>>>>>>>>> scale the
>>>>>>>>>>>>> resource properly. Try querying tables which contain not more 
>>>>>>>>>>>>> than 100k
>>>>>>>>>>>>> rows by default.
>>>>>>>>>>>>>
>>>>>>>>>>>>>     Here are some tips which may help you to check if the
>>>>>>>>>>>>> daemon service is in health status and resources(particularly 
>>>>>>>>>>>>> disk space)
>>>>>>>>>>>>> is configured properly.
>>>>>>>>>>>>>
>>>>>>>>>>>>>     1. Checking HDFS 's web ui(
>>>>>>>>>>>>> http://localhost:9870/dfshealth.html#tab-datanode ) to
>>>>>>>>>>>>> confirm whether HDFS service is in 'In service' status.
>>>>>>>>>>>>>     2. Checking Datanode 's log in
>>>>>>>>>>>>> `/opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log`, 
>>>>>>>>>>>>> check if
>>>>>>>>>>>>> there is any error message. Like: cat
>>>>>>>>>>>>> /opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log | 
>>>>>>>>>>>>> grep ERROR
>>>>>>>>>>>>> | wc -l
>>>>>>>>>>>>>     3. Checking if your docker engine is configured with
>>>>>>>>>>>>> enough disk space, if you are using Docker Desktop like me,please 
>>>>>>>>>>>>> go to
>>>>>>>>>>>>> "Settings" - "Resources" - "Advanced", make sure you have 
>>>>>>>>>>>>> allocated 40GB+
>>>>>>>>>>>>> disk space to the docker container.
>>>>>>>>>>>>>     4. Checking the available disk space of your container by
>>>>>>>>>>>>> `df -h`, make sure the 'Use%' of 'overlay' is less than 60% .
>>>>>>>>>>>>>     5. Checking the load average/ cpu usage/ jvm gc. Make sure
>>>>>>>>>>>>> these metrics are not really high when you send a query.
>>>>>>>>>>>>> ------------------------
>>>>>>>>>>>>> With warm regard
>>>>>>>>>>>>> Xiaoxiang Yu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 31, 2023 at 5:13 PM Nam Đỗ Duy
>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi ShaoFeng
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you very much for your valuable feedback
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I saw the application to be there (if I see it right) as in
>>>>>>>>>>>>>> the attachment photo. Kindly advise so that I can run this query 
>>>>>>>>>>>>>> on OLAP.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PS. I sent you the log file in private.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Oct 31, 2023 at 3:11 PM ShaoFeng Shi <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you provide the messages in logs/kylin.log when
>>>>>>>>>>>>>>> executing the SQL? and you can also check the Spark UI from 
>>>>>>>>>>>>>>> yarn resource
>>>>>>>>>>>>>>> manager (there should be one running application called 
>>>>>>>>>>>>>>> Spardar, which is
>>>>>>>>>>>>>>> Kylin's backend spark application). If the application is not 
>>>>>>>>>>>>>>> there, it may
>>>>>>>>>>>>>>> indicates the yarn doesn't have resource to startup it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>> Apache Kylin PMC,
>>>>>>>>>>>>>>> Apache Incubator PMC,
>>>>>>>>>>>>>>> Email: [email protected]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Apache Kylin FAQ:
>>>>>>>>>>>>>>> https://kylin.apache.org/docs/gettingstarted/faq.html
>>>>>>>>>>>>>>> Join Kylin user mail group: [email protected]
>>>>>>>>>>>>>>> Join Kylin dev mail group: [email protected]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nam Đỗ Duy <[email protected]> 于2023年10月31日周二 10:35写道：
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Dear Sir/Madam,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have a fact with 500million rows then I build model,
>>>>>>>>>>>>>>>> index according to the website help.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I chose full incremental because this is the first times I
>>>>>>>>>>>>>>>> load data
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I create both index types Aggregate group index, table
>>>>>>>>>>>>>>>> index as photo attached.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But the query always failed after timeout of 300 seconds (I
>>>>>>>>>>>>>>>> run in docker), I dont want to increase the value of 300 
>>>>>>>>>>>>>>>> seconds because I
>>>>>>>>>>>>>>>> wish the OLAP can run within 1 minutes (is that possible?)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It seems that the OLAP function in indexing not working to
>>>>>>>>>>>>>>>> speedup the query by precomputed cube.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can you advise to check whether the index did really work?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It is quite urgent task for me so prompt response is highly
>>>>>>>>>>>>>>>> appreciated.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you very much
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: OLAP functionalities in Kylin 5.0 seems not yet working for me

Reply via email to