Re: OLAP functionalities in Kylin 5.0 seems not yet working for me

Nam Đỗ Duy Wed, 01 Nov 2023 02:58:00 -0700

Thank you Xiaoxiang, I nearly got to the point.

So can I interpret that: 1 model equal (~) to a set of Joins of (Dim/Fact)
table, that is to say we need to create several models according to
multiple kinds of joins queries?


Best regards

On Wed, Nov 1, 2023 at 4:50 PM Xiaoxiang Yu <x...@apache.org> wrote:

> Have you ever tried to analyse the reason why your query can not hit Model
> 'sample_ssb'?
> It is because the join relation of your query is not suitable for the join
> relation/pattern of  Model 'sample_ssb'.
>
> Your query used a join relation/pattern like: A inner join B.
> But the Model 'sample_ssb' used a join relation/pattern like : A inner
> join B inner join C.
>
> If you are familiar with the definition of Inner join, you may know that
> the
> relation/pattern 'A inner join B inner join C' will have a chance
> to lose some rows when compared to pattern 'A inner join B'.
> So the Model 'sample_ssb' will be excluded to serve your query.
>
> That is to say, you need to create a new model that is similar to Model
> 'sample_ssb',
>  but with additional tables removed.
>
>
>
> ------------------------
> With warm regard
> Xiaoxiang Yu
>
>
>
> On Wed, Nov 1, 2023 at 5:21 PM Nam Đỗ Duy <na...@vnpay.vn.invalid> wrote:
>
>> Hi Xiaoxiang,
>>
>> Thank you very much
>>
>> I have clearer picture of Kylin already thanks to your explanation.
>>
>> Now back to the sample project of SSB in attached photo, when I run this
>> query with push_down option OFF, why the OLAP error appears, and in such
>> case, how to create a new cube for this query?
>>
>> [image: image.png]
>>
>> On Wed, Nov 1, 2023 at 3:49 PM Xiaoxiang Yu <x...@apache.org> wrote:
>>
>>> Here is some of my explanation and it may not be perfect.
>>> Segment in Kylin is part of model/cube pre-computed data, in most
>>> cases, divided by date column.
>>>
>>> Here is some difference between Segment and Snapshot.
>>> Segment, whose source data comes from one fact table joins some dimension
>>> tables with 'specific date range', is 'precomputed', and will accelerate
>>> complex query.
>>> Snapshot, whose source data comes from one specific dimension table without
>>> specific date range, is "not precomputed", and can join with segments
>>> at runtime .
>>>
>>> - https://kylin.apache.org/5.0/docs/snapshot/snapshot_management
>>> -
>>> https://kylin.apache.org/5.0/docs/modeling/load_data/segment_operation_settings/intro
>>>
>>> ------------------------
>>> With warm regard
>>> Xiaoxiang Yu
>>>
>>>
>>>
>>> On Wed, Nov 1, 2023 at 3:53 PM Nam Đỗ Duy <na...@vnpay.vn> wrote:
>>>
>>>> Thank you again, very smart of you to automatically select cube for a
>>>> certain query. Sorry If I ask too much: Is the concept of Segment in Kylin
>>>> model similar to Slice-and-Dice concept of Cube, what is the different
>>>> between Kylin Segment and Kylin Snapshot?
>>>>
>>>> PS. I sent you the log files for your help in investigating why my cube
>>>> has not been used.
>>>>
>>>> On Wed, Nov 1, 2023 at 2:36 PM Xiaoxiang Yu <x...@apache.org> wrote:
>>>>
>>>>> I guess there is a misunderstanding from your sentences.
>>>>>
>>>>> -- 'I need to select Cube from a combo box below the query window'
>>>>> It is not right to use 'need', that combo box is for some specific
>>>>> cases(for example, Kylin did not choose a cube which is the most
>>>>> efficient), not the most cases.
>>>>> In most cases(both for Kylin 4 and Kylin 5), you don't need to select
>>>>> a Cube in the combo box, Kylin will do the choice for you.
>>>>>
>>>>> ------------------------
>>>>> With warm regard
>>>>> Xiaoxiang Yu
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 1, 2023 at 3:24 PM Nam Đỗ Duy <na...@vnpay.vn.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi Xiaoxiang, sorry if I made you confused (Anyway, it is just a
>>>>>> question of a beginner)
>>>>>>
>>>>>> "obviously" means "clearly"
>>>>>>
>>>>>> because I need to select Cube from a combo box below the query window
>>>>>>
>>>>>> Thank you very much
>>>>>>
>>>>>> On Wed, Nov 1, 2023 at 2:20 PM Xiaoxiang Yu <x...@apache.org> wrote:
>>>>>>
>>>>>>> From my side, I cannot understand why you say Kylin 4 is 'very
>>>>>>> obviously'. Can you give an example?
>>>>>>> From the source code, the basic logic of choosing the right
>>>>>>> cube/model are similar.
>>>>>>> ------------------------
>>>>>>> With warm regard
>>>>>>> Xiaoxiang Yu
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 1, 2023 at 3:01 PM Nam Đỗ Duy <na...@vnpay.vn> wrote:
>>>>>>>
>>>>>>>> Thank you for your kind reply, please answer 1 more question about
>>>>>>>> version 5:
>>>>>>>>
>>>>>>>> In version 4.x we run query against a Cube very obviously, but in
>>>>>>>> version 5, the cube usage is a implication socan you advise: for a 
>>>>>>>> given
>>>>>>>> query, which model will be used, which index (cube) will be used for 
>>>>>>>> this
>>>>>>>> query?
>>>>>>>>
>>>>>>>> Thank you
>>>>>>>>
>>>>>>>> On Wed, Nov 1, 2023 at 1:42 PM Xiaoxiang Yu <x...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> 1. How do I measure the size of the index (cube) in version 5?
>>>>>>>>>    You can check storage of specific Indexes from the Index page.
>>>>>>>>>
>>>>>>>>> https://kylin.apache.org/5.0/docs/modeling/model_design/aggregation_group#view-aggregate-index
>>>>>>>>> or
>>>>>>>>> https://kylin.apache.org/5.0/assets/images/index_1-6ad3f55183d4ed61962359d9408ba192.png
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2. How to create the cardinality for each column?
>>>>>>>>>    You should check this link :
>>>>>>>>> https://kylin.apache.org/5.0/docs/datasource/data_sampling/ .
>>>>>>>>>
>>>>>>>>> 3. In your default project sample named SSB project, you have only
>>>>>>>>> 4 simple aggregate group index and no table index as in attached file
>>>>>>>>> so what is the best strategy to select index for our OLAP?
>>>>>>>>>     1. There does exist a 'Base Table Index'  by default
>>>>>>>>> actually, its id is 20000000001.
>>>>>>>>>     2. I think it is a good question and Kylin 5 lacks such a
>>>>>>>>> guide for better modeling. You are free to ask your question to
>>>>>>>>> mailing list and I will try to reply.
>>>>>>>>>
>>>>>>>>> ------------------------
>>>>>>>>> With warm regard
>>>>>>>>> Xiaoxiang Yu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 1, 2023 at 2:12 PM Xiaoxiang Yu <x...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> OK, I didn't read all the mail history so I misunderstand the
>>>>>>>>>> situation. Looks like you need to analyse
>>>>>>>>>> the cause why the query didn't hit the cube correctly.
>>>>>>>>>>
>>>>>>>>>> Please generate query diagnosis package and send it to me
>>>>>>>>>> privately. I will analyse the query log.
>>>>>>>>>> You can refer to the following steps in screenshots.
>>>>>>>>>>
>>>>>>>>>> [image: image.png]
>>>>>>>>>>
>>>>>>>>>> If the screenshots are not displaying correctly, please read this
>>>>>>>>>> guide :
>>>>>>>>>>
>>>>>>>>>> https://kylin.apache.org/5.0/docs/operations/system-operation/diagnosis/#generate-query-diagnosis-package-in-web-ui
>>>>>>>>>>
>>>>>>>>>> By the way, you need to analyse the cause by reading
>>>>>>>>>> kylin.query.log, not the kylin.log,
>>>>>>>>>> refer to
>>>>>>>>>> https://kylin.apache.org/5.0/docs/operations/logs/system_log
>>>>>>>>>>
>>>>>>>>>> ------------------------
>>>>>>>>>> With warm regard
>>>>>>>>>> Xiaoxiang Yu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 1, 2023 at 12:18 PM Nam Đỗ Duy <na...@vnpay.vn>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you Xiaoxiang for your advice. As my title email shown, I
>>>>>>>>>>> guessed that the OLAP functionalities has not been correctly set up 
>>>>>>>>>>> in my
>>>>>>>>>>> computer.
>>>>>>>>>>>
>>>>>>>>>>> The evidence about it is that: when I disable the Pushdown
>>>>>>>>>>> option box to use solely the precomputation cube only, it showed 
>>>>>>>>>>> following
>>>>>>>>>>> error: Please kindly advise how to properly build the OLAP
>>>>>>>>>>>
>>>>>>>>>>> LIMIT 500": No realization found for OLAPContext, 
>>>>>>>>>>> MODEL_UNMATCHED_JOIN, 
>>>>>>>>>>> rel#2240:KapTableScan.OLAP.[](table=[VNEVENT_HIVE_DWH_400MILLION_ROWS,
>>>>>>>>>>>  FACTUSEREVENT],ctx=0@null,fields=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
>>>>>>>>>>> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 1, 2023 at 10:40 AM Xiaoxiang Yu <x...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>     Yesterday, I tried to see if query pushdown functions work
>>>>>>>>>>>> well in the Kylin5 docker, and all of my queries return proper 
>>>>>>>>>>>> responses .
>>>>>>>>>>>>     After checking your logs from Shaofeng, I found these error
>>>>>>>>>>>> messages repeated many times:
>>>>>>>>>>>>     1. 'java.io.IOException: All datanodes
>>>>>>>>>>>> DatanodeInfoWithStorage[127.0.0.1:9866,DS-5093899b-06c7-4386-95d5-6fc271d92b52,DISK]
>>>>>>>>>>>> are bad. Aborting...'
>>>>>>>>>>>>     2. 'curator.ConnectionState : Connection timed out for
>>>>>>>>>>>> connection string (localhost:2181) and timeout (15000) / elapsed 
>>>>>>>>>>>> (41794)
>>>>>>>>>>>> org.apache.curator.CuratorConnectionLossException:
>>>>>>>>>>>> KeeperErrorCode = ConnectionLoss'
>>>>>>>>>>>>
>>>>>>>>>>>>     I guess the root cause is that the container didn't not
>>>>>>>>>>>> have enough resources. I found you query on a table called
>>>>>>>>>>>> 'XXX_hive_dwh_400million_rows', looks like you gave a complex 
>>>>>>>>>>>> query on a
>>>>>>>>>>>> table which contains 400 million rows?
>>>>>>>>>>>>
>>>>>>>>>>>>     Since I am the uploader of kylin5 's docker image, I want
>>>>>>>>>>>> to give some explainment. Kylin5 docker is not a place for 
>>>>>>>>>>>> performance
>>>>>>>>>>>> benchmarks, it is only for demonstration. It is only allocated 
>>>>>>>>>>>> with very
>>>>>>>>>>>> little resources(8G memory) if you are using the default command 
>>>>>>>>>>>> from
>>>>>>>>>>>> docker hub page. Before I uploaded my image, I only tested my 
>>>>>>>>>>>> image using
>>>>>>>>>>>> the ssb dataset, which the biggest table only contains about 60k 
>>>>>>>>>>>> rows. If
>>>>>>>>>>>> you are using a larger dataset and complexer queries, you have to 
>>>>>>>>>>>> scale the
>>>>>>>>>>>> resource properly. Try querying tables which contain not more than 
>>>>>>>>>>>> 100k
>>>>>>>>>>>> rows by default.
>>>>>>>>>>>>
>>>>>>>>>>>>     Here are some tips which may help you to check if the
>>>>>>>>>>>> daemon service is in health status and resources(particularly disk 
>>>>>>>>>>>> space)
>>>>>>>>>>>> is configured properly.
>>>>>>>>>>>>
>>>>>>>>>>>>     1. Checking HDFS 's web ui(
>>>>>>>>>>>> http://localhost:9870/dfshealth.html#tab-datanode ) to confirm
>>>>>>>>>>>> whether HDFS service is in 'In service' status.
>>>>>>>>>>>>     2. Checking Datanode 's log in
>>>>>>>>>>>> `/opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log`, 
>>>>>>>>>>>> check if
>>>>>>>>>>>> there is any error message. Like: cat
>>>>>>>>>>>> /opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log | 
>>>>>>>>>>>> grep ERROR
>>>>>>>>>>>> | wc -l
>>>>>>>>>>>>     3. Checking if your docker engine is configured with enough
>>>>>>>>>>>> disk space, if you are using Docker Desktop like me,please go to 
>>>>>>>>>>>> "Settings"
>>>>>>>>>>>> - "Resources" - "Advanced", make sure you have allocated 40GB+ 
>>>>>>>>>>>> disk space
>>>>>>>>>>>> to the docker container.
>>>>>>>>>>>>     4. Checking the available disk space of your container by
>>>>>>>>>>>> `df -h`, make sure the 'Use%' of 'overlay' is less than 60% .
>>>>>>>>>>>>     5. Checking the load average/ cpu usage/ jvm gc. Make sure
>>>>>>>>>>>> these metrics are not really high when you send a query.
>>>>>>>>>>>> ------------------------
>>>>>>>>>>>> With warm regard
>>>>>>>>>>>> Xiaoxiang Yu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 31, 2023 at 5:13 PM Nam Đỗ Duy
>>>>>>>>>>>> <na...@vnpay.vn.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi ShaoFeng
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you very much for your valuable feedback
>>>>>>>>>>>>>
>>>>>>>>>>>>> I saw the application to be there (if I see it right) as in
>>>>>>>>>>>>> the attachment photo. Kindly advise so that I can run this query 
>>>>>>>>>>>>> on OLAP.
>>>>>>>>>>>>>
>>>>>>>>>>>>> PS. I sent you the log file in private.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 31, 2023 at 3:11 PM ShaoFeng Shi <
>>>>>>>>>>>>> shaofeng...@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can you provide the messages in logs/kylin.log when executing
>>>>>>>>>>>>>> the SQL? and you can also check the Spark UI from yarn resource 
>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>> (there should be one running application called Spardar, which 
>>>>>>>>>>>>>> is Kylin's
>>>>>>>>>>>>>> backend spark application). If the application is not there, it 
>>>>>>>>>>>>>> may
>>>>>>>>>>>>>> indicates the yarn doesn't have resource to startup it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>> Apache Kylin PMC,
>>>>>>>>>>>>>> Apache Incubator PMC,
>>>>>>>>>>>>>> Email: shaofeng...@apache.org
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Apache Kylin FAQ:
>>>>>>>>>>>>>> https://kylin.apache.org/docs/gettingstarted/faq.html
>>>>>>>>>>>>>> Join Kylin user mail group: user-subscr...@kylin.apache.org
>>>>>>>>>>>>>> Join Kylin dev mail group: dev-subscr...@kylin.apache.org
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nam Đỗ Duy <na...@vnpay.vn> 于2023年10月31日周二 10:35写道：
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dear Sir/Madam,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have a fact with 500million rows then I build model, index
>>>>>>>>>>>>>>> according to the website help.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I chose full incremental because this is the first times I
>>>>>>>>>>>>>>> load data
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I create both index types Aggregate group index, table index
>>>>>>>>>>>>>>> as photo attached.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But the query always failed after timeout of 300 seconds (I
>>>>>>>>>>>>>>> run in docker), I dont want to increase the value of 300 
>>>>>>>>>>>>>>> seconds because I
>>>>>>>>>>>>>>> wish the OLAP can run within 1 minutes (is that possible?)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It seems that the OLAP function in indexing not working to
>>>>>>>>>>>>>>> speedup the query by precomputed cube.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you advise to check whether the index did really work?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It is quite urgent task for me so prompt response is highly
>>>>>>>>>>>>>>> appreciated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you very much
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: OLAP functionalities in Kylin 5.0 seems not yet working for me

Reply via email to