Re: OLAP functionalities in Kylin 5.0 seems not yet working for me

Xiaoxiang Yu Wed, 01 Nov 2023 02:50:45 -0700

Have you ever tried to analyse the reason why your query can not hit Model
'sample_ssb'?
It is because the join relation of your query is not suitable for the join
relation/pattern of  Model 'sample_ssb'.


Your query used a join relation/pattern like: A inner join B.
But the Model 'sample_ssb' used a join relation/pattern like : A inner join
B inner join C.

If you are familiar with the definition of Inner join, you may know that
the
relation/pattern 'A inner join B inner join C' will have a chance
to lose some rows when compared to pattern 'A inner join B'.
So the Model 'sample_ssb' will be excluded to serve your query.

That is to say, you need to create a new model that is similar to Model
'sample_ssb',
 but with additional tables removed.



------------------------
With warm regard
Xiaoxiang Yu



On Wed, Nov 1, 2023 at 5:21 PM Nam Đỗ Duy <[email protected]> wrote:

> Hi Xiaoxiang,
>
> Thank you very much
>
> I have clearer picture of Kylin already thanks to your explanation.
>
> Now back to the sample project of SSB in attached photo, when I run this
> query with push_down option OFF, why the OLAP error appears, and in such
> case, how to create a new cube for this query?
>
> [image: image.png]
>
> On Wed, Nov 1, 2023 at 3:49 PM Xiaoxiang Yu <[email protected]> wrote:
>
>> Here is some of my explanation and it may not be perfect.
>> Segment in Kylin is part of model/cube pre-computed data, in most
>> cases, divided by date column.
>>
>> Here is some difference between Segment and Snapshot.
>> Segment, whose source data comes from one fact table joins some dimension
>> tables with 'specific date range', is 'precomputed', and will accelerate
>> complex query.
>> Snapshot, whose source data comes from one specific dimension table without
>> specific date range, is "not precomputed", and can join with segments at
>> runtime .
>>
>> - https://kylin.apache.org/5.0/docs/snapshot/snapshot_management
>> -
>> https://kylin.apache.org/5.0/docs/modeling/load_data/segment_operation_settings/intro
>>
>> ------------------------
>> With warm regard
>> Xiaoxiang Yu
>>
>>
>>
>> On Wed, Nov 1, 2023 at 3:53 PM Nam Đỗ Duy <[email protected]> wrote:
>>
>>> Thank you again, very smart of you to automatically select cube for a
>>> certain query. Sorry If I ask too much: Is the concept of Segment in Kylin
>>> model similar to Slice-and-Dice concept of Cube, what is the different
>>> between Kylin Segment and Kylin Snapshot?
>>>
>>> PS. I sent you the log files for your help in investigating why my cube
>>> has not been used.
>>>
>>> On Wed, Nov 1, 2023 at 2:36 PM Xiaoxiang Yu <[email protected]> wrote:
>>>
>>>> I guess there is a misunderstanding from your sentences.
>>>>
>>>> -- 'I need to select Cube from a combo box below the query window'
>>>> It is not right to use 'need', that combo box is for some specific
>>>> cases(for example, Kylin did not choose a cube which is the most
>>>> efficient), not the most cases.
>>>> In most cases(both for Kylin 4 and Kylin 5), you don't need to select a
>>>> Cube in the combo box, Kylin will do the choice for you.
>>>>
>>>> ------------------------
>>>> With warm regard
>>>> Xiaoxiang Yu
>>>>
>>>>
>>>>
>>>> On Wed, Nov 1, 2023 at 3:24 PM Nam Đỗ Duy <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Xiaoxiang, sorry if I made you confused (Anyway, it is just a
>>>>> question of a beginner)
>>>>>
>>>>> "obviously" means "clearly"
>>>>>
>>>>> because I need to select Cube from a combo box below the query window
>>>>>
>>>>> Thank you very much
>>>>>
>>>>> On Wed, Nov 1, 2023 at 2:20 PM Xiaoxiang Yu <[email protected]> wrote:
>>>>>
>>>>>> From my side, I cannot understand why you say Kylin 4 is 'very
>>>>>> obviously'. Can you give an example?
>>>>>> From the source code, the basic logic of choosing the right
>>>>>> cube/model are similar.
>>>>>> ------------------------
>>>>>> With warm regard
>>>>>> Xiaoxiang Yu
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 1, 2023 at 3:01 PM Nam Đỗ Duy <[email protected]> wrote:
>>>>>>
>>>>>>> Thank you for your kind reply, please answer 1 more question about
>>>>>>> version 5:
>>>>>>>
>>>>>>> In version 4.x we run query against a Cube very obviously, but in
>>>>>>> version 5, the cube usage is a implication socan you advise: for a given
>>>>>>> query, which model will be used, which index (cube) will be used for 
>>>>>>> this
>>>>>>> query?
>>>>>>>
>>>>>>> Thank you
>>>>>>>
>>>>>>> On Wed, Nov 1, 2023 at 1:42 PM Xiaoxiang Yu <[email protected]> wrote:
>>>>>>>
>>>>>>>> 1. How do I measure the size of the index (cube) in version 5?
>>>>>>>>    You can check storage of specific Indexes from the Index page.
>>>>>>>>
>>>>>>>> https://kylin.apache.org/5.0/docs/modeling/model_design/aggregation_group#view-aggregate-index
>>>>>>>> or
>>>>>>>> https://kylin.apache.org/5.0/assets/images/index_1-6ad3f55183d4ed61962359d9408ba192.png
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. How to create the cardinality for each column?
>>>>>>>>    You should check this link :
>>>>>>>> https://kylin.apache.org/5.0/docs/datasource/data_sampling/ .
>>>>>>>>
>>>>>>>> 3. In your default project sample named SSB project, you have only
>>>>>>>> 4 simple aggregate group index and no table index as in attached file
>>>>>>>> so what is the best strategy to select index for our OLAP?
>>>>>>>>     1. There does exist a 'Base Table Index'  by default actually,
>>>>>>>> its id is 20000000001.
>>>>>>>>     2. I think it is a good question and Kylin 5 lacks such a guide
>>>>>>>> for better modeling. You are free to ask your question to
>>>>>>>> mailing list and I will try to reply.
>>>>>>>>
>>>>>>>> ------------------------
>>>>>>>> With warm regard
>>>>>>>> Xiaoxiang Yu
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 1, 2023 at 2:12 PM Xiaoxiang Yu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> OK, I didn't read all the mail history so I misunderstand the
>>>>>>>>> situation. Looks like you need to analyse
>>>>>>>>> the cause why the query didn't hit the cube correctly.
>>>>>>>>>
>>>>>>>>> Please generate query diagnosis package and send it to me
>>>>>>>>> privately. I will analyse the query log.
>>>>>>>>> You can refer to the following steps in screenshots.
>>>>>>>>>
>>>>>>>>> [image: image.png]
>>>>>>>>>
>>>>>>>>> If the screenshots are not displaying correctly, please read this
>>>>>>>>> guide :
>>>>>>>>>
>>>>>>>>> https://kylin.apache.org/5.0/docs/operations/system-operation/diagnosis/#generate-query-diagnosis-package-in-web-ui
>>>>>>>>>
>>>>>>>>> By the way, you need to analyse the cause by reading
>>>>>>>>> kylin.query.log, not the kylin.log,
>>>>>>>>> refer to
>>>>>>>>> https://kylin.apache.org/5.0/docs/operations/logs/system_log
>>>>>>>>>
>>>>>>>>> ------------------------
>>>>>>>>> With warm regard
>>>>>>>>> Xiaoxiang Yu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 1, 2023 at 12:18 PM Nam Đỗ Duy <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you Xiaoxiang for your advice. As my title email shown, I
>>>>>>>>>> guessed that the OLAP functionalities has not been correctly set up 
>>>>>>>>>> in my
>>>>>>>>>> computer.
>>>>>>>>>>
>>>>>>>>>> The evidence about it is that: when I disable the Pushdown option
>>>>>>>>>> box to use solely the precomputation cube only, it showed following 
>>>>>>>>>> error:
>>>>>>>>>> Please kindly advise how to properly build the OLAP
>>>>>>>>>>
>>>>>>>>>> LIMIT 500": No realization found for OLAPContext, 
>>>>>>>>>> MODEL_UNMATCHED_JOIN, 
>>>>>>>>>> rel#2240:KapTableScan.OLAP.[](table=[VNEVENT_HIVE_DWH_400MILLION_ROWS,
>>>>>>>>>>  FACTUSEREVENT],ctx=0@null,fields=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
>>>>>>>>>> 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 1, 2023 at 10:40 AM Xiaoxiang Yu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>     Yesterday, I tried to see if query pushdown functions work
>>>>>>>>>>> well in the Kylin5 docker, and all of my queries return proper 
>>>>>>>>>>> responses .
>>>>>>>>>>>     After checking your logs from Shaofeng, I found these error
>>>>>>>>>>> messages repeated many times:
>>>>>>>>>>>     1. 'java.io.IOException: All datanodes
>>>>>>>>>>> DatanodeInfoWithStorage[127.0.0.1:9866,DS-5093899b-06c7-4386-95d5-6fc271d92b52,DISK]
>>>>>>>>>>> are bad. Aborting...'
>>>>>>>>>>>     2. 'curator.ConnectionState : Connection timed out for
>>>>>>>>>>> connection string (localhost:2181) and timeout (15000) / elapsed 
>>>>>>>>>>> (41794)
>>>>>>>>>>> org.apache.curator.CuratorConnectionLossException:
>>>>>>>>>>> KeeperErrorCode = ConnectionLoss'
>>>>>>>>>>>
>>>>>>>>>>>     I guess the root cause is that the container didn't not have
>>>>>>>>>>> enough resources. I found you query on a table called
>>>>>>>>>>> 'XXX_hive_dwh_400million_rows', looks like you gave a complex query 
>>>>>>>>>>> on a
>>>>>>>>>>> table which contains 400 million rows?
>>>>>>>>>>>
>>>>>>>>>>>     Since I am the uploader of kylin5 's docker image, I want to
>>>>>>>>>>> give some explainment. Kylin5 docker is not a place for performance
>>>>>>>>>>> benchmarks, it is only for demonstration. It is only allocated with 
>>>>>>>>>>> very
>>>>>>>>>>> little resources(8G memory) if you are using the default command 
>>>>>>>>>>> from
>>>>>>>>>>> docker hub page. Before I uploaded my image, I only tested my image 
>>>>>>>>>>> using
>>>>>>>>>>> the ssb dataset, which the biggest table only contains about 60k 
>>>>>>>>>>> rows. If
>>>>>>>>>>> you are using a larger dataset and complexer queries, you have to 
>>>>>>>>>>> scale the
>>>>>>>>>>> resource properly. Try querying tables which contain not more than 
>>>>>>>>>>> 100k
>>>>>>>>>>> rows by default.
>>>>>>>>>>>
>>>>>>>>>>>     Here are some tips which may help you to check if the daemon
>>>>>>>>>>> service is in health status and resources(particularly disk space) 
>>>>>>>>>>> is
>>>>>>>>>>> configured properly.
>>>>>>>>>>>
>>>>>>>>>>>     1. Checking HDFS 's web ui(
>>>>>>>>>>> http://localhost:9870/dfshealth.html#tab-datanode ) to confirm
>>>>>>>>>>> whether HDFS service is in 'In service' status.
>>>>>>>>>>>     2. Checking Datanode 's log in
>>>>>>>>>>> `/opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log`, 
>>>>>>>>>>> check if
>>>>>>>>>>> there is any error message. Like: cat
>>>>>>>>>>> /opt/hadoop-3.2.1/logs/hadoop-root-datanode-Kylin5-Machine.log | 
>>>>>>>>>>> grep ERROR
>>>>>>>>>>> | wc -l
>>>>>>>>>>>     3. Checking if your docker engine is configured with enough
>>>>>>>>>>> disk space, if you are using Docker Desktop like me,please go to 
>>>>>>>>>>> "Settings"
>>>>>>>>>>> - "Resources" - "Advanced", make sure you have allocated 40GB+ disk 
>>>>>>>>>>> space
>>>>>>>>>>> to the docker container.
>>>>>>>>>>>     4. Checking the available disk space of your container by
>>>>>>>>>>> `df -h`, make sure the 'Use%' of 'overlay' is less than 60% .
>>>>>>>>>>>     5. Checking the load average/ cpu usage/ jvm gc. Make sure
>>>>>>>>>>> these metrics are not really high when you send a query.
>>>>>>>>>>> ------------------------
>>>>>>>>>>> With warm regard
>>>>>>>>>>> Xiaoxiang Yu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 31, 2023 at 5:13 PM Nam Đỗ Duy
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi ShaoFeng
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you very much for your valuable feedback
>>>>>>>>>>>>
>>>>>>>>>>>> I saw the application to be there (if I see it right) as in the
>>>>>>>>>>>> attachment photo. Kindly advise so that I can run this query on 
>>>>>>>>>>>> OLAP.
>>>>>>>>>>>>
>>>>>>>>>>>> PS. I sent you the log file in private.
>>>>>>>>>>>>
>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 31, 2023 at 3:11 PM ShaoFeng Shi <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Can you provide the messages in logs/kylin.log when executing
>>>>>>>>>>>>> the SQL? and you can also check the Spark UI from yarn resource 
>>>>>>>>>>>>> manager
>>>>>>>>>>>>> (there should be one running application called Spardar, which is 
>>>>>>>>>>>>> Kylin's
>>>>>>>>>>>>> backend spark application). If the application is not there, it 
>>>>>>>>>>>>> may
>>>>>>>>>>>>> indicates the yarn doesn't have resource to startup it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>> Apache Kylin PMC,
>>>>>>>>>>>>> Apache Incubator PMC,
>>>>>>>>>>>>> Email: [email protected]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Apache Kylin FAQ:
>>>>>>>>>>>>> https://kylin.apache.org/docs/gettingstarted/faq.html
>>>>>>>>>>>>> Join Kylin user mail group: [email protected]
>>>>>>>>>>>>> Join Kylin dev mail group: [email protected]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nam Đỗ Duy <[email protected]> 于2023年10月31日周二 10:35写道：
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dear Sir/Madam,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have a fact with 500million rows then I build model, index
>>>>>>>>>>>>>> according to the website help.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I chose full incremental because this is the first times I
>>>>>>>>>>>>>> load data
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I create both index types Aggregate group index, table index
>>>>>>>>>>>>>> as photo attached.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But the query always failed after timeout of 300 seconds (I
>>>>>>>>>>>>>> run in docker), I dont want to increase the value of 300 seconds 
>>>>>>>>>>>>>> because I
>>>>>>>>>>>>>> wish the OLAP can run within 1 minutes (is that possible?)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It seems that the OLAP function in indexing not working to
>>>>>>>>>>>>>> speedup the query by precomputed cube.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can you advise to check whether the index did really work?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It is quite urgent task for me so prompt response is highly
>>>>>>>>>>>>>> appreciated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you very much
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: OLAP functionalities in Kylin 5.0 seems not yet working for me

Reply via email to