Re: OLAP functionalities in Kylin 5.0 seems not yet working for me

Xiaoxiang Yu Wed, 01 Nov 2023 20:00:45 -0700

Congratulations, hope you will make good use of the ability of Kylin 5 for
your use cases.



------------------------
With warm regard
Xiaoxiang Yu



On Thu, Nov 2, 2023 at 10:50 AM Nam Đỗ Duy <[email protected]> wrote:

> The query is too fast, less than a second, can you make it a little bit
> slower so that I can see it clearly 😀😀
> [image: image.png]
>
> On Thu, Nov 2, 2023 at 9:32 AM Nam Đỗ Duy <[email protected]> wrote:
>
>> Thank you Xiaoxiang for the guideline. Will definitely read it carefully.
>> Kindly help the following questions:
>>
>> 1. Computed column
>>
>> I created a “computed column” and add it to dimensions (among other
>> dimensions)
>>
>> When I use query to select the computed column it returned error
>>
>> 2. Datatype optimization: will you think that the int be better than
>> string for key join columns?
>>
>> Please advise
>>
>>
>> On Wed, 1 Nov 2023 at 17:32 Xiaoxiang Yu <[email protected]> wrote:
>>
>>> Yes, that is almost correct.
>>>
>>> If you have a lot of complex queries, and you want to using Kylin 5 to
>>> accelerate them, the recommended steps of mine are as follows:
>>>
>>> 1. You analyse all queries and collect all join relation/pattern.
>>> 2. You create Models for each specific join relation/pattern, with the
>>> join
>>> relation you find in above step.
>>> 3. You analyse and collect dimensions and measures from all queries, and
>>> add them to the corresponding Model.
>>> 4. You build segments of all Models with proper data range.
>>> 5. You turned off the pushdown switch, and sent all queries to Kylin. If
>>> there are some queries which failed, fix them.
>>>     Here are some common situations.
>>>     5.1 Join relation/pattern is not matched
>>>     5.2 If the join relation is matched, the Model might not contain
>>> every
>>> column that your query needs, please check kylin.query.log with keyword '
>>> unmatched'.
>>> 6. (Optional) If you find some of your queries do not exactly match with
>>> your Index(your query on [colA, colB], but your index contains more
>>> columns
>>> than colA and colB), you can add some aggregate groups(or smaller Table
>>> Index) to optimize the query performance.
>>>
>>>
>>>
>>> ------------------------
>>> With warm regard
>>> Xiaoxiang Yu
>>>
>>>
>>>
>>> On Wed, Nov 1, 2023 at 5:57 PM Nam Đỗ Duy <[email protected]>
>>> wrote:
>>>
>>> > Thank you Xiaoxiang, I nearly got to the point.
>>> >
>>> > So can I interpret that: 1 model equal (~) to a set of Joins of
>>> (Dim/Fact)
>>> > table, that is to say we need to create several models according to
>>> > multiple kinds of joins queries?
>>> >
>>> > Best regards
>>> >
>>> > On Wed, Nov 1, 2023 at 4:50 PM Xiaoxiang Yu <[email protected]> wrote:
>>> >
>>> >> Have you ever tried to analyse the reason why your query can not hit
>>> >> Model 'sample_ssb'?
>>> >> It is because the join relation of your query is not suitable for the
>>> >> join relation/pattern of  Model 'sample_ssb'.
>>> >>
>>> >> Your query used a join relation/pattern like: A inner join B.
>>> >> But the Model 'sample_ssb' used a join relation/pattern like : A inner
>>> >> join B inner join C.
>>> >>
>>> >> If you are familiar with the definition of Inner join, you may know
>>> that
>>> >> the
>>> >> relation/pattern 'A inner join B inner join C' will have a chance
>>> >> to lose some rows when compared to pattern 'A inner join B'.
>>> >> So the Model 'sample_ssb' will be excluded to serve your query.
>>> >>
>>> >> That is to say, you need to create a new model that is similar to
>>> Model
>>> >> 'sample_ssb',
>>> >>  but with additional tables removed.
>>> >>
>>> >>
>>> >>
>>> >> ------------------------
>>> >> With warm regard
>>> >> Xiaoxiang Yu
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Nov 1, 2023 at 5:21 PM Nam Đỗ Duy <[email protected]>
>>> wrote:
>>> >>
>>> >>> Hi Xiaoxiang,
>>> >>>
>>> >>> Thank you very much
>>> >>>
>>> >>> I have clearer picture of Kylin already thanks to your explanation.
>>> >>>
>>> >>> Now back to the sample project of SSB in attached photo, when I run
>>> this
>>> >>> query with push_down option OFF, why the OLAP error appears, and in
>>> such
>>> >>> case, how to create a new cube for this query?
>>> >>>
>>> >>> [image: image.png]
>>> >>>
>>> >>> On Wed, Nov 1, 2023 at 3:49 PM Xiaoxiang Yu <[email protected]> wrote:
>>> >>>
>>> >>>> Here is some of my explanation and it may not be perfect.
>>> >>>> Segment in Kylin is part of model/cube pre-computed data, in most
>>> >>>> cases, divided by date column.
>>> >>>>
>>> >>>> Here is some difference between Segment and Snapshot.
>>> >>>> Segment, whose source data comes from one fact table joins some
>>> dimension
>>> >>>> tables with 'specific date range', is 'precomputed', and will
>>> accelerate
>>> >>>> complex query.
>>> >>>> Snapshot, whose source data comes from one specific dimension table
>>> without
>>> >>>> specific date range, is "not precomputed", and can join with
>>> segments
>>> >>>> at runtime .
>>> >>>>
>>> >>>> - https://kylin.apache.org/5.0/docs/snapshot/snapshot_management
>>> >>>> -
>>> >>>>
>>> https://kylin.apache.org/5.0/docs/modeling/load_data/segment_operation_settings/intro
>>> >>>>
>>> >>>> ------------------------
>>> >>>> With warm regard
>>> >>>> Xiaoxiang Yu
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Wed, Nov 1, 2023 at 3:53 PM Nam Đỗ Duy <[email protected]> wrote:
>>> >>>>
>>> >>>>> Thank you again, very smart of you to automatically select cube
>>> for a
>>> >>>>> certain query. Sorry If I ask too much: Is the concept of Segment
>>> in Kylin
>>> >>>>> model similar to Slice-and-Dice concept of Cube, what is the
>>> different
>>> >>>>> between Kylin Segment and Kylin Snapshot?
>>> >>>>>
>>> >>>>> PS. I sent you the log files for your help in investigating why my
>>> >>>>> cube has not been used.
>>> >>>>>
>>> >>>>> On Wed, Nov 1, 2023 at 2:36 PM Xiaoxiang Yu <[email protected]>
>>> wrote:
>>> >>>>>
>>> >>>>>> I guess there is a misunderstanding from your sentences.
>>> >>>>>>
>>> >>>>>> -- 'I need to select Cube from a combo box below the query window'
>>> >>>>>> It is not right to use 'need', that combo box is for some specific
>>> >>>>>> cases(for example, Kylin did not choose a cube which is the most
>>> >>>>>> efficient), not the most cases.
>>> >>>>>> In most cases(both for Kylin 4 and Kylin 5), you don't need to
>>> select
>>> >>>>>> a Cube in the combo box, Kylin will do the choice for you.
>>> >>>>>>
>>> >>>>>> ------------------------
>>> >>>>>> With warm regard
>>> >>>>>> Xiaoxiang Yu
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Wed, Nov 1, 2023 at 3:24 PM Nam Đỗ Duy <[email protected]
>>> >
>>> >>>>>> wrote:
>>> >>>>>>
>>> >>>>>>> Hi Xiaoxiang, sorry if I made you confused (Anyway, it is just a
>>> >>>>>>> question of a beginner)
>>> >>>>>>>
>>> >>>>>>> "obviously" means "clearly"
>>> >>>>>>>
>>> >>>>>>> because I need to select Cube from a combo box below the query
>>> window
>>> >>>>>>>
>>> >>>>>>> Thank you very much
>>> >>>>>>>
>>> >>>>>>> On Wed, Nov 1, 2023 at 2:20 PM Xiaoxiang Yu <[email protected]>
>>> wrote:
>>> >>>>>>>
>>> >>>>>>>> From my side, I cannot understand why you say Kylin 4 is 'very
>>> >>>>>>>> obviously'. Can you give an example?
>>> >>>>>>>> From the source code, the basic logic of choosing the right
>>> >>>>>>>> cube/model are similar.
>>> >>>>>>>> ------------------------
>>> >>>>>>>> With warm regard
>>> >>>>>>>> Xiaoxiang Yu
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Wed, Nov 1, 2023 at 3:01 PM Nam Đỗ Duy <[email protected]>
>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>> Thank you for your kind reply, please answer 1 more question
>>> about
>>> >>>>>>>>> version 5:
>>> >>>>>>>>>
>>> >>>>>>>>> In version 4.x we run query against a Cube very obviously, but
>>> in
>>> >>>>>>>>> version 5, the cube usage is a implication socan you advise:
>>> for a given
>>> >>>>>>>>> query, which model will be used, which index (cube) will be
>>> used for this
>>> >>>>>>>>> query?
>>> >>>>>>>>>
>>> >>>>>>>>> Thank you
>>> >>>>>>>>>
>>> >>>>>>>>> On Wed, Nov 1, 2023 at 1:42 PM Xiaoxiang Yu <[email protected]>
>>> >>>>>>>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>> 1. How do I measure the size of the index (cube) in version 5?
>>> >>>>>>>>>>    You can check storage of specific Indexes from the Index
>>> page.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> https://kylin.apache.org/5.0/docs/modeling/model_design/aggregation_group#view-aggregate-index
>>> >>>>>>>>>> or
>>> >>>>>>>>>>
>>> https://kylin.apache.org/5.0/assets/images/index_1-6ad3f55183d4ed61962359d9408ba192.png
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> 2. How to create the cardinality for each column?
>>> >>>>>>>>>>    You should check this link :
>>> >>>>>>>>>> https://kylin.apache.org/5.0/docs/datasource/data_sampling/ .
>>> >>>>>>>>>>
>>> >>>>>>>>>> 3. In your default project sample named SSB project, you have
>>> >>>>>>>>>> only 4 simple aggregate group index and no table index as in
>>> attached file
>>> >>>>>>>>>> so what is the best strategy to select index for our OLAP?
>>> >>>>>>>>>>     1. There does exist a 'Base Table Index'  by default
>>> >>>>>>>>>> actually, its id is 20000000001.
>>> >>>>>>>>>>     2. I think it is a good question and Kylin 5 lacks such a
>>> >>>>>>>>>> guide for better modeling. You are free to ask your question
>>> to
>>> >>>>>>>>>> mailing list and I will try to reply.
>>> >>>>>>>>>>
>>> >>>>>>>>>> ------------------------
>>> >>>>>>>>>> With warm regard
>>> >>>>>>>>>> Xiaoxiang Yu
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Wed, Nov 1, 2023 at 2:12 PM Xiaoxiang Yu <[email protected]>
>>> >>>>>>>>>> wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>>> OK, I didn't read all the mail history so I misunderstand the
>>> >>>>>>>>>>> situation. Looks like you need to analyse
>>> >>>>>>>>>>> the cause why the query didn't hit the cube correctly.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Please generate query diagnosis package and send it to me
>>> >>>>>>>>>>> privately. I will analyse the query log.
>>> >>>>>>>>>>> You can refer to the following steps in screenshots.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> [image: image.png]
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> If the screenshots are not displaying correctly, please read
>>> >>>>>>>>>>> this guide :
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> https://kylin.apache.org/5.0/docs/operations/system-operation/diagnosis/#generate-query-diagnosis-package-in-web-ui
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> By the way, you need to analyse the cause by reading
>>> >>>>>>>>>>> kylin.query.log, not the kylin.log,
>>> >>>>>>>>>>> refer to
>>> >>>>>>>>>>> https://kylin.apache.org/5.0/docs/operations/logs/system_log
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> ------------------------
>>> >>>>>>>>>>> With warm regard
>>> >>>>>>>>>>> Xiaoxiang Yu
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> On Wed, Nov 1, 2023 at 12:18 PM Nam Đỗ Duy <[email protected]>
>>> >>>>>>>>>>> wrote:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> Thank you Xiaoxiang for your advice. As my title email
>>> shown, I
>>> >>>>>>>>>>>> guessed that the OLAP functionalities has not been
>>> correctly set up in my
>>> >>>>>>>>>>>> computer.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> The evidence about it is that: when I disable the Pushdown
>>> >>>>>>>>>>>> option box to use solely the precomputation cube only, it
>>> showed following
>>> >>>>>>>>>>>> error: Please kindly advise how to properly build the OLAP
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> LIMIT 500": No realization found for OLAPContext,
>>> MODEL_UNMATCHED_JOIN,
>>> rel#2240:KapTableScan.OLAP.[](table=[VNEVENT_HIVE_DWH_400MILLION_ROWS,
>>> FACTUSEREVENT],ctx=0@null,fields=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
>>> 12, 13, 14, 15, 16, 17, 18, 19, 20])
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> On Wed, Nov 1, 2023 at 10:40 AM Xiaoxiang Yu <
>>> [email protected]>
>>> >>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>> Hi,
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>     Yesterday, I tried to see if query pushdown functions
>>> work
>>> >>>>>>>>>>>>> well in the Kylin5 docker, and all of my queries return
>>> proper responses .
>>> >>>>>>>>>>>>>     After checking your logs from Shaofeng, I found these
>>> >>>>>>>>>>>>> error messages repeated many times:
>>> >>>>>>>>>>>>>     1. 'java.io.IOException: All datanodes
>>> >>>>>>>>>>>>> DatanodeInfoWithStorage[127.0.0.1:9866
>>> ,DS-5093899b-06c7-4386-95d5-6fc271d92b52,DISK]
>>> >>>>>>>>>>>>> are bad. Aborting...'
>>> >>>>>>>>>>>>>     2. 'curator.ConnectionState : Connection timed out for
>>> >>>>>>>>>>>>> connection string (localhost:2181) and timeout (15000) /
>>> elapsed (41794)
>>> >>>>>>>>>>>>> org.apache.curator.CuratorConnectionLossException:
>>> >>>>>>>>>>>>> KeeperErrorCode = ConnectionLoss'
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>     I guess the root cause is that the container didn't not
>>> >>>>>>>>>>>>> have enough resources. I found you query on a table called
>>> >>>>>>>>>>>>> 'XXX_hive_dwh_400million_rows', looks like you gave a
>>> complex query on a
>>> >>>>>>>>>>>>> table which contains 400 million rows?
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>     Since I am the uploader of kylin5 's docker image, I
>>> want
>>> >>>>>>>>>>>>> to give some explainment. Kylin5 docker is not a place for
>>> performance
>>> >>>>>>>>>>>>> benchmarks, it is only for demonstration. It is only
>>> allocated with very
>>> >>>>>>>>>>>>> little resources(8G memory) if you are using the default
>>> command from
>>> >>>>>>>>>>>>> docker hub page. Before I uploaded my image, I only tested
>>> my image using
>>> >>>>>>>>>>>>> the ssb dataset, which the biggest table only contains
>>> about 60k rows. If
>>> >>>>>>>>>>>>> you are using a larger dataset and complexer queries, you
>>> have to scale the
>>> >>>>>>>>>>>>> resource properly. Try querying tables which contain not
>>> more than 100k
>>> >>>>>>>>>>>>> rows by default.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>     Here are some tips which may help you to check if the
>>> >>>>>>>>>>>>> daemon service is in health status and
>>> resources(particularly disk space)
>>> >>>>>>>>>>>>> is configured properly.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>     1. Checking HDFS 's web ui(
>>> >>>>>>>>>>>>> http://localhost:9870/dfshealth.html#tab-datanode ) to
>>> >>>>>>>>>>>>> confirm whether HDFS service is in 'In service' status.
>>> >>>>>>>>>>>>>     2. Checking Datanode 's log in
>>> >>>>>>>>>>>>>
>>> `/opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log`, check if
>>> >>>>>>>>>>>>> there is any error message. Like: cat
>>> >>>>>>>>>>>>>
>>> /opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log | grep ERROR
>>> >>>>>>>>>>>>> | wc -l
>>> >>>>>>>>>>>>>     3. Checking if your docker engine is configured with
>>> >>>>>>>>>>>>> enough disk space, if you are using Docker Desktop like
>>> me,please go to
>>> >>>>>>>>>>>>> "Settings" - "Resources" - "Advanced", make sure you have
>>> allocated 40GB+
>>> >>>>>>>>>>>>> disk space to the docker container.
>>> >>>>>>>>>>>>>     4. Checking the available disk space of your container
>>> by
>>> >>>>>>>>>>>>> `df -h`, make sure the 'Use%' of 'overlay' is less than
>>> 60% .
>>> >>>>>>>>>>>>>     5. Checking the load average/ cpu usage/ jvm gc. Make
>>> sure
>>> >>>>>>>>>>>>> these metrics are not really high when you send a query.
>>> >>>>>>>>>>>>> ------------------------
>>> >>>>>>>>>>>>> With warm regard
>>> >>>>>>>>>>>>> Xiaoxiang Yu
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> On Tue, Oct 31, 2023 at 5:13 PM Nam Đỗ Duy
>>> >>>>>>>>>>>>> <[email protected]> wrote:
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Hi ShaoFeng
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Thank you very much for your valuable feedback
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> I saw the application to be there (if I see it right) as
>>> in
>>> >>>>>>>>>>>>>> the attachment photo. Kindly advise so that I can run
>>> this query on OLAP.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> PS. I sent you the log file in private.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> [image: image.png]
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> On Tue, Oct 31, 2023 at 3:11 PM ShaoFeng Shi <
>>> >>>>>>>>>>>>>> [email protected]> wrote:
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Can you provide the messages in logs/kylin.log when
>>> >>>>>>>>>>>>>>> executing the SQL? and you can also check the Spark UI
>>> from yarn resource
>>> >>>>>>>>>>>>>>> manager (there should be one running application called
>>> Spardar, which is
>>> >>>>>>>>>>>>>>> Kylin's backend spark application). If the application
>>> is not there, it may
>>> >>>>>>>>>>>>>>> indicates the yarn doesn't have resource to startup it.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Best regards,
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>> >>>>>>>>>>>>>>> Apache Kylin PMC,
>>> >>>>>>>>>>>>>>> Apache Incubator PMC,
>>> >>>>>>>>>>>>>>> Email: [email protected]
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Apache Kylin FAQ:
>>> >>>>>>>>>>>>>>> https://kylin.apache.org/docs/gettingstarted/faq.html
>>> >>>>>>>>>>>>>>> Join Kylin user mail group:
>>> [email protected]
>>> >>>>>>>>>>>>>>> Join Kylin dev mail group:
>>> [email protected]
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Nam Đỗ Duy <[email protected]> 于2023年10月31日周二 10:35写道：
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> Dear Sir/Madam,
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> I have a fact with 500million rows then I build model,
>>> >>>>>>>>>>>>>>>> index according to the website help.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> I chose full incremental because this is the first
>>> times I
>>> >>>>>>>>>>>>>>>> load data
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> I create both index types Aggregate group index, table
>>> >>>>>>>>>>>>>>>> index as photo attached.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> But the query always failed after timeout of 300
>>> seconds (I
>>> >>>>>>>>>>>>>>>> run in docker), I dont want to increase the value of
>>> 300 seconds because I
>>> >>>>>>>>>>>>>>>> wish the OLAP can run within 1 minutes (is that
>>> possible?)
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> It seems that the OLAP function in indexing not working
>>> to
>>> >>>>>>>>>>>>>>>> speedup the query by precomputed cube.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> Can you advise to check whether the index did really
>>> work?
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> It is quite urgent task for me so prompt response is
>>> highly
>>> >>>>>>>>>>>>>>>> appreciated.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> Thank you very much
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>>
>>

Re: OLAP functionalities in Kylin 5.0 seems not yet working for me

Reply via email to