Re: OLAP functionalities in Kylin 5.0 seems not yet working for me

Nam Đỗ Duy Wed, 01 Nov 2023 19:33:17 -0700

Thank you Xiaoxiang for the guideline. Will definitely read it carefully.
Kindly help the following questions:


1. Computed column

I created a “computed column” and add it to dimensions (among other
dimensions)

When I use query to select the computed column it returned error

2. Datatype optimization: will you think that the int be better than string
for key join columns?

Please advise


On Wed, 1 Nov 2023 at 17:32 Xiaoxiang Yu <[email protected]> wrote:

> Yes, that is almost correct.
>
> If you have a lot of complex queries, and you want to using Kylin 5 to
> accelerate them, the recommended steps of mine are as follows:
>
> 1. You analyse all queries and collect all join relation/pattern.
> 2. You create Models for each specific join relation/pattern, with the join
> relation you find in above step.
> 3. You analyse and collect dimensions and measures from all queries, and
> add them to the corresponding Model.
> 4. You build segments of all Models with proper data range.
> 5. You turned off the pushdown switch, and sent all queries to Kylin. If
> there are some queries which failed, fix them.
>     Here are some common situations.
>     5.1 Join relation/pattern is not matched
>     5.2 If the join relation is matched, the Model might not contain every
> column that your query needs, please check kylin.query.log with keyword '
> unmatched'.
> 6. (Optional) If you find some of your queries do not exactly match with
> your Index(your query on [colA, colB], but your index contains more columns
> than colA and colB), you can add some aggregate groups(or smaller Table
> Index) to optimize the query performance.
>
>
>
> ------------------------
> With warm regard
> Xiaoxiang Yu
>
>
>
> On Wed, Nov 1, 2023 at 5:57 PM Nam Đỗ Duy <[email protected]> wrote:
>
> > Thank you Xiaoxiang, I nearly got to the point.
> >
> > So can I interpret that: 1 model equal (~) to a set of Joins of
> (Dim/Fact)
> > table, that is to say we need to create several models according to
> > multiple kinds of joins queries?
> >
> > Best regards
> >
> > On Wed, Nov 1, 2023 at 4:50 PM Xiaoxiang Yu <[email protected]> wrote:
> >
> >> Have you ever tried to analyse the reason why your query can not hit
> >> Model 'sample_ssb'?
> >> It is because the join relation of your query is not suitable for the
> >> join relation/pattern of  Model 'sample_ssb'.
> >>
> >> Your query used a join relation/pattern like: A inner join B.
> >> But the Model 'sample_ssb' used a join relation/pattern like : A inner
> >> join B inner join C.
> >>
> >> If you are familiar with the definition of Inner join, you may know that
> >> the
> >> relation/pattern 'A inner join B inner join C' will have a chance
> >> to lose some rows when compared to pattern 'A inner join B'.
> >> So the Model 'sample_ssb' will be excluded to serve your query.
> >>
> >> That is to say, you need to create a new model that is similar to Model
> >> 'sample_ssb',
> >>  but with additional tables removed.
> >>
> >>
> >>
> >> ------------------------
> >> With warm regard
> >> Xiaoxiang Yu
> >>
> >>
> >>
> >> On Wed, Nov 1, 2023 at 5:21 PM Nam Đỗ Duy <[email protected]>
> wrote:
> >>
> >>> Hi Xiaoxiang,
> >>>
> >>> Thank you very much
> >>>
> >>> I have clearer picture of Kylin already thanks to your explanation.
> >>>
> >>> Now back to the sample project of SSB in attached photo, when I run
> this
> >>> query with push_down option OFF, why the OLAP error appears, and in
> such
> >>> case, how to create a new cube for this query?
> >>>
> >>> [image: image.png]
> >>>
> >>> On Wed, Nov 1, 2023 at 3:49 PM Xiaoxiang Yu <[email protected]> wrote:
> >>>
> >>>> Here is some of my explanation and it may not be perfect.
> >>>> Segment in Kylin is part of model/cube pre-computed data, in most
> >>>> cases, divided by date column.
> >>>>
> >>>> Here is some difference between Segment and Snapshot.
> >>>> Segment, whose source data comes from one fact table joins some
> dimension
> >>>> tables with 'specific date range', is 'precomputed', and will
> accelerate
> >>>> complex query.
> >>>> Snapshot, whose source data comes from one specific dimension table
> without
> >>>> specific date range, is "not precomputed", and can join with segments
> >>>> at runtime .
> >>>>
> >>>> - https://kylin.apache.org/5.0/docs/snapshot/snapshot_management
> >>>> -
> >>>>
> https://kylin.apache.org/5.0/docs/modeling/load_data/segment_operation_settings/intro
> >>>>
> >>>> ------------------------
> >>>> With warm regard
> >>>> Xiaoxiang Yu
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Nov 1, 2023 at 3:53 PM Nam Đỗ Duy <[email protected]> wrote:
> >>>>
> >>>>> Thank you again, very smart of you to automatically select cube for a
> >>>>> certain query. Sorry If I ask too much: Is the concept of Segment in
> Kylin
> >>>>> model similar to Slice-and-Dice concept of Cube, what is the
> different
> >>>>> between Kylin Segment and Kylin Snapshot?
> >>>>>
> >>>>> PS. I sent you the log files for your help in investigating why my
> >>>>> cube has not been used.
> >>>>>
> >>>>> On Wed, Nov 1, 2023 at 2:36 PM Xiaoxiang Yu <[email protected]> wrote:
> >>>>>
> >>>>>> I guess there is a misunderstanding from your sentences.
> >>>>>>
> >>>>>> -- 'I need to select Cube from a combo box below the query window'
> >>>>>> It is not right to use 'need', that combo box is for some specific
> >>>>>> cases(for example, Kylin did not choose a cube which is the most
> >>>>>> efficient), not the most cases.
> >>>>>> In most cases(both for Kylin 4 and Kylin 5), you don't need to
> select
> >>>>>> a Cube in the combo box, Kylin will do the choice for you.
> >>>>>>
> >>>>>> ------------------------
> >>>>>> With warm regard
> >>>>>> Xiaoxiang Yu
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Nov 1, 2023 at 3:24 PM Nam Đỗ Duy <[email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Xiaoxiang, sorry if I made you confused (Anyway, it is just a
> >>>>>>> question of a beginner)
> >>>>>>>
> >>>>>>> "obviously" means "clearly"
> >>>>>>>
> >>>>>>> because I need to select Cube from a combo box below the query
> window
> >>>>>>>
> >>>>>>> Thank you very much
> >>>>>>>
> >>>>>>> On Wed, Nov 1, 2023 at 2:20 PM Xiaoxiang Yu <[email protected]>
> wrote:
> >>>>>>>
> >>>>>>>> From my side, I cannot understand why you say Kylin 4 is 'very
> >>>>>>>> obviously'. Can you give an example?
> >>>>>>>> From the source code, the basic logic of choosing the right
> >>>>>>>> cube/model are similar.
> >>>>>>>> ------------------------
> >>>>>>>> With warm regard
> >>>>>>>> Xiaoxiang Yu
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Nov 1, 2023 at 3:01 PM Nam Đỗ Duy <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>>> Thank you for your kind reply, please answer 1 more question
> about
> >>>>>>>>> version 5:
> >>>>>>>>>
> >>>>>>>>> In version 4.x we run query against a Cube very obviously, but in
> >>>>>>>>> version 5, the cube usage is a implication socan you advise: for
> a given
> >>>>>>>>> query, which model will be used, which index (cube) will be used
> for this
> >>>>>>>>> query?
> >>>>>>>>>
> >>>>>>>>> Thank you
> >>>>>>>>>
> >>>>>>>>> On Wed, Nov 1, 2023 at 1:42 PM Xiaoxiang Yu <[email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> 1. How do I measure the size of the index (cube) in version 5?
> >>>>>>>>>>    You can check storage of specific Indexes from the Index
> page.
> >>>>>>>>>>
> >>>>>>>>>>
> https://kylin.apache.org/5.0/docs/modeling/model_design/aggregation_group#view-aggregate-index
> >>>>>>>>>> or
> >>>>>>>>>>
> https://kylin.apache.org/5.0/assets/images/index_1-6ad3f55183d4ed61962359d9408ba192.png
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2. How to create the cardinality for each column?
> >>>>>>>>>>    You should check this link :
> >>>>>>>>>> https://kylin.apache.org/5.0/docs/datasource/data_sampling/ .
> >>>>>>>>>>
> >>>>>>>>>> 3. In your default project sample named SSB project, you have
> >>>>>>>>>> only 4 simple aggregate group index and no table index as in
> attached file
> >>>>>>>>>> so what is the best strategy to select index for our OLAP?
> >>>>>>>>>>     1. There does exist a 'Base Table Index'  by default
> >>>>>>>>>> actually, its id is 20000000001.
> >>>>>>>>>>     2. I think it is a good question and Kylin 5 lacks such a
> >>>>>>>>>> guide for better modeling. You are free to ask your question to
> >>>>>>>>>> mailing list and I will try to reply.
> >>>>>>>>>>
> >>>>>>>>>> ------------------------
> >>>>>>>>>> With warm regard
> >>>>>>>>>> Xiaoxiang Yu
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Nov 1, 2023 at 2:12 PM Xiaoxiang Yu <[email protected]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> OK, I didn't read all the mail history so I misunderstand the
> >>>>>>>>>>> situation. Looks like you need to analyse
> >>>>>>>>>>> the cause why the query didn't hit the cube correctly.
> >>>>>>>>>>>
> >>>>>>>>>>> Please generate query diagnosis package and send it to me
> >>>>>>>>>>> privately. I will analyse the query log.
> >>>>>>>>>>> You can refer to the following steps in screenshots.
> >>>>>>>>>>>
> >>>>>>>>>>> [image: image.png]
> >>>>>>>>>>>
> >>>>>>>>>>> If the screenshots are not displaying correctly, please read
> >>>>>>>>>>> this guide :
> >>>>>>>>>>>
> >>>>>>>>>>>
> https://kylin.apache.org/5.0/docs/operations/system-operation/diagnosis/#generate-query-diagnosis-package-in-web-ui
> >>>>>>>>>>>
> >>>>>>>>>>> By the way, you need to analyse the cause by reading
> >>>>>>>>>>> kylin.query.log, not the kylin.log,
> >>>>>>>>>>> refer to
> >>>>>>>>>>> https://kylin.apache.org/5.0/docs/operations/logs/system_log
> >>>>>>>>>>>
> >>>>>>>>>>> ------------------------
> >>>>>>>>>>> With warm regard
> >>>>>>>>>>> Xiaoxiang Yu
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Nov 1, 2023 at 12:18 PM Nam Đỗ Duy <[email protected]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thank you Xiaoxiang for your advice. As my title email shown,
> I
> >>>>>>>>>>>> guessed that the OLAP functionalities has not been correctly
> set up in my
> >>>>>>>>>>>> computer.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The evidence about it is that: when I disable the Pushdown
> >>>>>>>>>>>> option box to use solely the precomputation cube only, it
> showed following
> >>>>>>>>>>>> error: Please kindly advise how to properly build the OLAP
> >>>>>>>>>>>>
> >>>>>>>>>>>> LIMIT 500": No realization found for OLAPContext,
> MODEL_UNMATCHED_JOIN,
> rel#2240:KapTableScan.OLAP.[](table=[VNEVENT_HIVE_DWH_400MILLION_ROWS,
> FACTUSEREVENT],ctx=0@null,fields=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
> 12, 13, 14, 15, 16, 17, 18, 19, 20])
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Nov 1, 2023 at 10:40 AM Xiaoxiang Yu <[email protected]
> >
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>     Yesterday, I tried to see if query pushdown functions
> work
> >>>>>>>>>>>>> well in the Kylin5 docker, and all of my queries return
> proper responses .
> >>>>>>>>>>>>>     After checking your logs from Shaofeng, I found these
> >>>>>>>>>>>>> error messages repeated many times:
> >>>>>>>>>>>>>     1. 'java.io.IOException: All datanodes
> >>>>>>>>>>>>> DatanodeInfoWithStorage[127.0.0.1:9866
> ,DS-5093899b-06c7-4386-95d5-6fc271d92b52,DISK]
> >>>>>>>>>>>>> are bad. Aborting...'
> >>>>>>>>>>>>>     2. 'curator.ConnectionState : Connection timed out for
> >>>>>>>>>>>>> connection string (localhost:2181) and timeout (15000) /
> elapsed (41794)
> >>>>>>>>>>>>> org.apache.curator.CuratorConnectionLossException:
> >>>>>>>>>>>>> KeeperErrorCode = ConnectionLoss'
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>     I guess the root cause is that the container didn't not
> >>>>>>>>>>>>> have enough resources. I found you query on a table called
> >>>>>>>>>>>>> 'XXX_hive_dwh_400million_rows', looks like you gave a
> complex query on a
> >>>>>>>>>>>>> table which contains 400 million rows?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>     Since I am the uploader of kylin5 's docker image, I want
> >>>>>>>>>>>>> to give some explainment. Kylin5 docker is not a place for
> performance
> >>>>>>>>>>>>> benchmarks, it is only for demonstration. It is only
> allocated with very
> >>>>>>>>>>>>> little resources(8G memory) if you are using the default
> command from
> >>>>>>>>>>>>> docker hub page. Before I uploaded my image, I only tested
> my image using
> >>>>>>>>>>>>> the ssb dataset, which the biggest table only contains about
> 60k rows. If
> >>>>>>>>>>>>> you are using a larger dataset and complexer queries, you
> have to scale the
> >>>>>>>>>>>>> resource properly. Try querying tables which contain not
> more than 100k
> >>>>>>>>>>>>> rows by default.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>     Here are some tips which may help you to check if the
> >>>>>>>>>>>>> daemon service is in health status and
> resources(particularly disk space)
> >>>>>>>>>>>>> is configured properly.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>     1. Checking HDFS 's web ui(
> >>>>>>>>>>>>> http://localhost:9870/dfshealth.html#tab-datanode ) to
> >>>>>>>>>>>>> confirm whether HDFS service is in 'In service' status.
> >>>>>>>>>>>>>     2. Checking Datanode 's log in
> >>>>>>>>>>>>>
> `/opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log`, check if
> >>>>>>>>>>>>> there is any error message. Like: cat
> >>>>>>>>>>>>>
> /opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log | grep ERROR
> >>>>>>>>>>>>> | wc -l
> >>>>>>>>>>>>>     3. Checking if your docker engine is configured with
> >>>>>>>>>>>>> enough disk space, if you are using Docker Desktop like
> me,please go to
> >>>>>>>>>>>>> "Settings" - "Resources" - "Advanced", make sure you have
> allocated 40GB+
> >>>>>>>>>>>>> disk space to the docker container.
> >>>>>>>>>>>>>     4. Checking the available disk space of your container by
> >>>>>>>>>>>>> `df -h`, make sure the 'Use%' of 'overlay' is less than 60% .
> >>>>>>>>>>>>>     5. Checking the load average/ cpu usage/ jvm gc. Make
> sure
> >>>>>>>>>>>>> these metrics are not really high when you send a query.
> >>>>>>>>>>>>> ------------------------
> >>>>>>>>>>>>> With warm regard
> >>>>>>>>>>>>> Xiaoxiang Yu
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Oct 31, 2023 at 5:13 PM Nam Đỗ Duy
> >>>>>>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi ShaoFeng
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thank you very much for your valuable feedback
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I saw the application to be there (if I see it right) as in
> >>>>>>>>>>>>>> the attachment photo. Kindly advise so that I can run this
> query on OLAP.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> PS. I sent you the log file in private.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [image: image.png]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Oct 31, 2023 at 3:11 PM ShaoFeng Shi <
> >>>>>>>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Can you provide the messages in logs/kylin.log when
> >>>>>>>>>>>>>>> executing the SQL? and you can also check the Spark UI
> from yarn resource
> >>>>>>>>>>>>>>> manager (there should be one running application called
> Spardar, which is
> >>>>>>>>>>>>>>> Kylin's backend spark application). If the application is
> not there, it may
> >>>>>>>>>>>>>>> indicates the yarn doesn't have resource to startup it.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Shaofeng Shi 史少锋
> >>>>>>>>>>>>>>> Apache Kylin PMC,
> >>>>>>>>>>>>>>> Apache Incubator PMC,
> >>>>>>>>>>>>>>> Email: [email protected]
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Apache Kylin FAQ:
> >>>>>>>>>>>>>>> https://kylin.apache.org/docs/gettingstarted/faq.html
> >>>>>>>>>>>>>>> Join Kylin user mail group:
> [email protected]
> >>>>>>>>>>>>>>> Join Kylin dev mail group: [email protected]
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Nam Đỗ Duy <[email protected]> 于2023年10月31日周二 10:35写道：
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Dear Sir/Madam,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I have a fact with 500million rows then I build model,
> >>>>>>>>>>>>>>>> index according to the website help.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I chose full incremental because this is the first times I
> >>>>>>>>>>>>>>>> load data
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I create both index types Aggregate group index, table
> >>>>>>>>>>>>>>>> index as photo attached.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> But the query always failed after timeout of 300 seconds
> (I
> >>>>>>>>>>>>>>>> run in docker), I dont want to increase the value of 300
> seconds because I
> >>>>>>>>>>>>>>>> wish the OLAP can run within 1 minutes (is that possible?)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> It seems that the OLAP function in indexing not working to
> >>>>>>>>>>>>>>>> speedup the query by precomputed cube.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Can you advise to check whether the index did really work?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> It is quite urgent task for me so prompt response is
> highly
> >>>>>>>>>>>>>>>> appreciated.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thank you very much
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
>

Re: OLAP functionalities in Kylin 5.0 seems not yet working for me

Reply via email to