Re: OLAP functionalities in Kylin 5.0 seems not yet working for me

Nam Đỗ Duy Wed, 01 Nov 2023 19:50:55 -0700

The query is too fast, less than a second, can you make it a little bit
slower so that I can see it clearly 😀😀
[image: image.png]


On Thu, Nov 2, 2023 at 9:32 AM Nam Đỗ Duy <[email protected]> wrote:

> Thank you Xiaoxiang for the guideline. Will definitely read it carefully.
> Kindly help the following questions:
>
> 1. Computed column
>
> I created a “computed column” and add it to dimensions (among other
> dimensions)
>
> When I use query to select the computed column it returned error
>
> 2. Datatype optimization: will you think that the int be better than
> string for key join columns?
>
> Please advise
>
>
> On Wed, 1 Nov 2023 at 17:32 Xiaoxiang Yu <[email protected]> wrote:
>
>> Yes, that is almost correct.
>>
>> If you have a lot of complex queries, and you want to using Kylin 5 to
>> accelerate them, the recommended steps of mine are as follows:
>>
>> 1. You analyse all queries and collect all join relation/pattern.
>> 2. You create Models for each specific join relation/pattern, with the
>> join
>> relation you find in above step.
>> 3. You analyse and collect dimensions and measures from all queries, and
>> add them to the corresponding Model.
>> 4. You build segments of all Models with proper data range.
>> 5. You turned off the pushdown switch, and sent all queries to Kylin. If
>> there are some queries which failed, fix them.
>>     Here are some common situations.
>>     5.1 Join relation/pattern is not matched
>>     5.2 If the join relation is matched, the Model might not contain every
>> column that your query needs, please check kylin.query.log with keyword '
>> unmatched'.
>> 6. (Optional) If you find some of your queries do not exactly match with
>> your Index(your query on [colA, colB], but your index contains more
>> columns
>> than colA and colB), you can add some aggregate groups(or smaller Table
>> Index) to optimize the query performance.
>>
>>
>>
>> ------------------------
>> With warm regard
>> Xiaoxiang Yu
>>
>>
>>
>> On Wed, Nov 1, 2023 at 5:57 PM Nam Đỗ Duy <[email protected]> wrote:
>>
>> > Thank you Xiaoxiang, I nearly got to the point.
>> >
>> > So can I interpret that: 1 model equal (~) to a set of Joins of
>> (Dim/Fact)
>> > table, that is to say we need to create several models according to
>> > multiple kinds of joins queries?
>> >
>> > Best regards
>> >
>> > On Wed, Nov 1, 2023 at 4:50 PM Xiaoxiang Yu <[email protected]> wrote:
>> >
>> >> Have you ever tried to analyse the reason why your query can not hit
>> >> Model 'sample_ssb'?
>> >> It is because the join relation of your query is not suitable for the
>> >> join relation/pattern of  Model 'sample_ssb'.
>> >>
>> >> Your query used a join relation/pattern like: A inner join B.
>> >> But the Model 'sample_ssb' used a join relation/pattern like : A inner
>> >> join B inner join C.
>> >>
>> >> If you are familiar with the definition of Inner join, you may know
>> that
>> >> the
>> >> relation/pattern 'A inner join B inner join C' will have a chance
>> >> to lose some rows when compared to pattern 'A inner join B'.
>> >> So the Model 'sample_ssb' will be excluded to serve your query.
>> >>
>> >> That is to say, you need to create a new model that is similar to Model
>> >> 'sample_ssb',
>> >>  but with additional tables removed.
>> >>
>> >>
>> >>
>> >> ------------------------
>> >> With warm regard
>> >> Xiaoxiang Yu
>> >>
>> >>
>> >>
>> >> On Wed, Nov 1, 2023 at 5:21 PM Nam Đỗ Duy <[email protected]>
>> wrote:
>> >>
>> >>> Hi Xiaoxiang,
>> >>>
>> >>> Thank you very much
>> >>>
>> >>> I have clearer picture of Kylin already thanks to your explanation.
>> >>>
>> >>> Now back to the sample project of SSB in attached photo, when I run
>> this
>> >>> query with push_down option OFF, why the OLAP error appears, and in
>> such
>> >>> case, how to create a new cube for this query?
>> >>>
>> >>> [image: image.png]
>> >>>
>> >>> On Wed, Nov 1, 2023 at 3:49 PM Xiaoxiang Yu <[email protected]> wrote:
>> >>>
>> >>>> Here is some of my explanation and it may not be perfect.
>> >>>> Segment in Kylin is part of model/cube pre-computed data, in most
>> >>>> cases, divided by date column.
>> >>>>
>> >>>> Here is some difference between Segment and Snapshot.
>> >>>> Segment, whose source data comes from one fact table joins some
>> dimension
>> >>>> tables with 'specific date range', is 'precomputed', and will
>> accelerate
>> >>>> complex query.
>> >>>> Snapshot, whose source data comes from one specific dimension table
>> without
>> >>>> specific date range, is "not precomputed", and can join with segments
>> >>>> at runtime .
>> >>>>
>> >>>> - https://kylin.apache.org/5.0/docs/snapshot/snapshot_management
>> >>>> -
>> >>>>
>> https://kylin.apache.org/5.0/docs/modeling/load_data/segment_operation_settings/intro
>> >>>>
>> >>>> ------------------------
>> >>>> With warm regard
>> >>>> Xiaoxiang Yu
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Wed, Nov 1, 2023 at 3:53 PM Nam Đỗ Duy <[email protected]> wrote:
>> >>>>
>> >>>>> Thank you again, very smart of you to automatically select cube for
>> a
>> >>>>> certain query. Sorry If I ask too much: Is the concept of Segment
>> in Kylin
>> >>>>> model similar to Slice-and-Dice concept of Cube, what is the
>> different
>> >>>>> between Kylin Segment and Kylin Snapshot?
>> >>>>>
>> >>>>> PS. I sent you the log files for your help in investigating why my
>> >>>>> cube has not been used.
>> >>>>>
>> >>>>> On Wed, Nov 1, 2023 at 2:36 PM Xiaoxiang Yu <[email protected]>
>> wrote:
>> >>>>>
>> >>>>>> I guess there is a misunderstanding from your sentences.
>> >>>>>>
>> >>>>>> -- 'I need to select Cube from a combo box below the query window'
>> >>>>>> It is not right to use 'need', that combo box is for some specific
>> >>>>>> cases(for example, Kylin did not choose a cube which is the most
>> >>>>>> efficient), not the most cases.
>> >>>>>> In most cases(both for Kylin 4 and Kylin 5), you don't need to
>> select
>> >>>>>> a Cube in the combo box, Kylin will do the choice for you.
>> >>>>>>
>> >>>>>> ------------------------
>> >>>>>> With warm regard
>> >>>>>> Xiaoxiang Yu
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Nov 1, 2023 at 3:24 PM Nam Đỗ Duy <[email protected]>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Hi Xiaoxiang, sorry if I made you confused (Anyway, it is just a
>> >>>>>>> question of a beginner)
>> >>>>>>>
>> >>>>>>> "obviously" means "clearly"
>> >>>>>>>
>> >>>>>>> because I need to select Cube from a combo box below the query
>> window
>> >>>>>>>
>> >>>>>>> Thank you very much
>> >>>>>>>
>> >>>>>>> On Wed, Nov 1, 2023 at 2:20 PM Xiaoxiang Yu <[email protected]>
>> wrote:
>> >>>>>>>
>> >>>>>>>> From my side, I cannot understand why you say Kylin 4 is 'very
>> >>>>>>>> obviously'. Can you give an example?
>> >>>>>>>> From the source code, the basic logic of choosing the right
>> >>>>>>>> cube/model are similar.
>> >>>>>>>> ------------------------
>> >>>>>>>> With warm regard
>> >>>>>>>> Xiaoxiang Yu
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Wed, Nov 1, 2023 at 3:01 PM Nam Đỗ Duy <[email protected]>
>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Thank you for your kind reply, please answer 1 more question
>> about
>> >>>>>>>>> version 5:
>> >>>>>>>>>
>> >>>>>>>>> In version 4.x we run query against a Cube very obviously, but
>> in
>> >>>>>>>>> version 5, the cube usage is a implication socan you advise:
>> for a given
>> >>>>>>>>> query, which model will be used, which index (cube) will be
>> used for this
>> >>>>>>>>> query?
>> >>>>>>>>>
>> >>>>>>>>> Thank you
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Nov 1, 2023 at 1:42 PM Xiaoxiang Yu <[email protected]>
>> >>>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> 1. How do I measure the size of the index (cube) in version 5?
>> >>>>>>>>>>    You can check storage of specific Indexes from the Index
>> page.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> https://kylin.apache.org/5.0/docs/modeling/model_design/aggregation_group#view-aggregate-index
>> >>>>>>>>>> or
>> >>>>>>>>>>
>> https://kylin.apache.org/5.0/assets/images/index_1-6ad3f55183d4ed61962359d9408ba192.png
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> 2. How to create the cardinality for each column?
>> >>>>>>>>>>    You should check this link :
>> >>>>>>>>>> https://kylin.apache.org/5.0/docs/datasource/data_sampling/ .
>> >>>>>>>>>>
>> >>>>>>>>>> 3. In your default project sample named SSB project, you have
>> >>>>>>>>>> only 4 simple aggregate group index and no table index as in
>> attached file
>> >>>>>>>>>> so what is the best strategy to select index for our OLAP?
>> >>>>>>>>>>     1. There does exist a 'Base Table Index'  by default
>> >>>>>>>>>> actually, its id is 20000000001.
>> >>>>>>>>>>     2. I think it is a good question and Kylin 5 lacks such a
>> >>>>>>>>>> guide for better modeling. You are free to ask your question to
>> >>>>>>>>>> mailing list and I will try to reply.
>> >>>>>>>>>>
>> >>>>>>>>>> ------------------------
>> >>>>>>>>>> With warm regard
>> >>>>>>>>>> Xiaoxiang Yu
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Wed, Nov 1, 2023 at 2:12 PM Xiaoxiang Yu <[email protected]>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> OK, I didn't read all the mail history so I misunderstand the
>> >>>>>>>>>>> situation. Looks like you need to analyse
>> >>>>>>>>>>> the cause why the query didn't hit the cube correctly.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Please generate query diagnosis package and send it to me
>> >>>>>>>>>>> privately. I will analyse the query log.
>> >>>>>>>>>>> You can refer to the following steps in screenshots.
>> >>>>>>>>>>>
>> >>>>>>>>>>> [image: image.png]
>> >>>>>>>>>>>
>> >>>>>>>>>>> If the screenshots are not displaying correctly, please read
>> >>>>>>>>>>> this guide :
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> https://kylin.apache.org/5.0/docs/operations/system-operation/diagnosis/#generate-query-diagnosis-package-in-web-ui
>> >>>>>>>>>>>
>> >>>>>>>>>>> By the way, you need to analyse the cause by reading
>> >>>>>>>>>>> kylin.query.log, not the kylin.log,
>> >>>>>>>>>>> refer to
>> >>>>>>>>>>> https://kylin.apache.org/5.0/docs/operations/logs/system_log
>> >>>>>>>>>>>
>> >>>>>>>>>>> ------------------------
>> >>>>>>>>>>> With warm regard
>> >>>>>>>>>>> Xiaoxiang Yu
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Wed, Nov 1, 2023 at 12:18 PM Nam Đỗ Duy <[email protected]>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Thank you Xiaoxiang for your advice. As my title email
>> shown, I
>> >>>>>>>>>>>> guessed that the OLAP functionalities has not been correctly
>> set up in my
>> >>>>>>>>>>>> computer.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> The evidence about it is that: when I disable the Pushdown
>> >>>>>>>>>>>> option box to use solely the precomputation cube only, it
>> showed following
>> >>>>>>>>>>>> error: Please kindly advise how to properly build the OLAP
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> LIMIT 500": No realization found for OLAPContext,
>> MODEL_UNMATCHED_JOIN,
>> rel#2240:KapTableScan.OLAP.[](table=[VNEVENT_HIVE_DWH_400MILLION_ROWS,
>> FACTUSEREVENT],ctx=0@null,fields=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
>> 12, 13, 14, 15, 16, 17, 18, 19, 20])
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Wed, Nov 1, 2023 at 10:40 AM Xiaoxiang Yu <
>> [email protected]>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Hi,
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>     Yesterday, I tried to see if query pushdown functions
>> work
>> >>>>>>>>>>>>> well in the Kylin5 docker, and all of my queries return
>> proper responses .
>> >>>>>>>>>>>>>     After checking your logs from Shaofeng, I found these
>> >>>>>>>>>>>>> error messages repeated many times:
>> >>>>>>>>>>>>>     1. 'java.io.IOException: All datanodes
>> >>>>>>>>>>>>> DatanodeInfoWithStorage[127.0.0.1:9866
>> ,DS-5093899b-06c7-4386-95d5-6fc271d92b52,DISK]
>> >>>>>>>>>>>>> are bad. Aborting...'
>> >>>>>>>>>>>>>     2. 'curator.ConnectionState : Connection timed out for
>> >>>>>>>>>>>>> connection string (localhost:2181) and timeout (15000) /
>> elapsed (41794)
>> >>>>>>>>>>>>> org.apache.curator.CuratorConnectionLossException:
>> >>>>>>>>>>>>> KeeperErrorCode = ConnectionLoss'
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>     I guess the root cause is that the container didn't not
>> >>>>>>>>>>>>> have enough resources. I found you query on a table called
>> >>>>>>>>>>>>> 'XXX_hive_dwh_400million_rows', looks like you gave a
>> complex query on a
>> >>>>>>>>>>>>> table which contains 400 million rows?
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>     Since I am the uploader of kylin5 's docker image, I
>> want
>> >>>>>>>>>>>>> to give some explainment. Kylin5 docker is not a place for
>> performance
>> >>>>>>>>>>>>> benchmarks, it is only for demonstration. It is only
>> allocated with very
>> >>>>>>>>>>>>> little resources(8G memory) if you are using the default
>> command from
>> >>>>>>>>>>>>> docker hub page. Before I uploaded my image, I only tested
>> my image using
>> >>>>>>>>>>>>> the ssb dataset, which the biggest table only contains
>> about 60k rows. If
>> >>>>>>>>>>>>> you are using a larger dataset and complexer queries, you
>> have to scale the
>> >>>>>>>>>>>>> resource properly. Try querying tables which contain not
>> more than 100k
>> >>>>>>>>>>>>> rows by default.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>     Here are some tips which may help you to check if the
>> >>>>>>>>>>>>> daemon service is in health status and
>> resources(particularly disk space)
>> >>>>>>>>>>>>> is configured properly.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>     1. Checking HDFS 's web ui(
>> >>>>>>>>>>>>> http://localhost:9870/dfshealth.html#tab-datanode ) to
>> >>>>>>>>>>>>> confirm whether HDFS service is in 'In service' status.
>> >>>>>>>>>>>>>     2. Checking Datanode 's log in
>> >>>>>>>>>>>>>
>> `/opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log`, check if
>> >>>>>>>>>>>>> there is any error message. Like: cat
>> >>>>>>>>>>>>>
>> /opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log | grep ERROR
>> >>>>>>>>>>>>> | wc -l
>> >>>>>>>>>>>>>     3. Checking if your docker engine is configured with
>> >>>>>>>>>>>>> enough disk space, if you are using Docker Desktop like
>> me,please go to
>> >>>>>>>>>>>>> "Settings" - "Resources" - "Advanced", make sure you have
>> allocated 40GB+
>> >>>>>>>>>>>>> disk space to the docker container.
>> >>>>>>>>>>>>>     4. Checking the available disk space of your container
>> by
>> >>>>>>>>>>>>> `df -h`, make sure the 'Use%' of 'overlay' is less than 60%
>> .
>> >>>>>>>>>>>>>     5. Checking the load average/ cpu usage/ jvm gc. Make
>> sure
>> >>>>>>>>>>>>> these metrics are not really high when you send a query.
>> >>>>>>>>>>>>> ------------------------
>> >>>>>>>>>>>>> With warm regard
>> >>>>>>>>>>>>> Xiaoxiang Yu
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Tue, Oct 31, 2023 at 5:13 PM Nam Đỗ Duy
>> >>>>>>>>>>>>> <[email protected]> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Hi ShaoFeng
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Thank you very much for your valuable feedback
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I saw the application to be there (if I see it right) as in
>> >>>>>>>>>>>>>> the attachment photo. Kindly advise so that I can run this
>> query on OLAP.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> PS. I sent you the log file in private.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> [image: image.png]
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Tue, Oct 31, 2023 at 3:11 PM ShaoFeng Shi <
>> >>>>>>>>>>>>>> [email protected]> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Can you provide the messages in logs/kylin.log when
>> >>>>>>>>>>>>>>> executing the SQL? and you can also check the Spark UI
>> from yarn resource
>> >>>>>>>>>>>>>>> manager (there should be one running application called
>> Spardar, which is
>> >>>>>>>>>>>>>>> Kylin's backend spark application). If the application is
>> not there, it may
>> >>>>>>>>>>>>>>> indicates the yarn doesn't have resource to startup it.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>> >>>>>>>>>>>>>>> Apache Kylin PMC,
>> >>>>>>>>>>>>>>> Apache Incubator PMC,
>> >>>>>>>>>>>>>>> Email: [email protected]
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Apache Kylin FAQ:
>> >>>>>>>>>>>>>>> https://kylin.apache.org/docs/gettingstarted/faq.html
>> >>>>>>>>>>>>>>> Join Kylin user mail group:
>> [email protected]
>> >>>>>>>>>>>>>>> Join Kylin dev mail group: [email protected]
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Nam Đỗ Duy <[email protected]> 于2023年10月31日周二 10:35写道：
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Dear Sir/Madam,
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I have a fact with 500million rows then I build model,
>> >>>>>>>>>>>>>>>> index according to the website help.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I chose full incremental because this is the first times
>> I
>> >>>>>>>>>>>>>>>> load data
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I create both index types Aggregate group index, table
>> >>>>>>>>>>>>>>>> index as photo attached.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> But the query always failed after timeout of 300 seconds
>> (I
>> >>>>>>>>>>>>>>>> run in docker), I dont want to increase the value of 300
>> seconds because I
>> >>>>>>>>>>>>>>>> wish the OLAP can run within 1 minutes (is that
>> possible?)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> It seems that the OLAP function in indexing not working
>> to
>> >>>>>>>>>>>>>>>> speedup the query by precomputed cube.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Can you advise to check whether the index did really
>> work?
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> It is quite urgent task for me so prompt response is
>> highly
>> >>>>>>>>>>>>>>>> appreciated.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Thank you very much
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>>
>

Re: OLAP functionalities in Kylin 5.0 seems not yet working for me

Reply via email to