Re: doubt about measure of processedRowCount

ShaoFeng Shi Tue, 06 Nov 2018 17:29:44 -0800

Good job Jiatao! I appreciate your support to the community!

JiaTao Tao <taojia...@gmail.com> 于2018年11月7日周三 上午9:17写道：


> Very glad that my reply is helpful, I already opened a JIRA to add logs
> for "*GTStreamAggregateScanner*" and next time it would be much easier to
> navigate this :).
>
> cheney <531014...@qq.com> 于2018年11月6日周二 下午11:57写道：
>
>> Hi, JiaTao, thank you very much!  The statis is right when I config 
>> "kylin.query.stream-aggregate-enabled=false".
>> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "JiaTao Tao"<taojia...@gmail.com>;
>> *发送时间:* 2018年11月6日(星期二) 晚上10:50
>> *收件人:* "user"<u...@kylin.apache.org>;
>> *主题:* Re: doubt about measure of processedRowCount
>>
>> One possible place I can find in the code is using
>> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
>> You can find it does do aggregate in
>> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
>> reduce the inputs. But there's no log printing in this class as you can
>> see, so it's pretty hard to confirm. Try
>> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
>> see any differences.
>>
>> cheney <531014...@qq.com> 于2018年11月5日周一 下午6:55写道：
>>
>>> Yes. the log is as following.
>>>
>>> 2018-11-02 22:25:34,980 DEBUG [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>> gtrecord.StorageResponseGTScatter:88 : Using
>>> SortMergedPartitionResultIterator to merge 103 partition results
>>> 2018-11-02 22:25:34,982 INFO  [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>>> merge segment results*
>>> 2018-11-02 22:25:34,982 DEBUG [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>>> : return TupleIterator...
>>> 2018-11-02 22:25:34,991 INFO  [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : 
>>> *Processed
>>> rows for each storageContext*: 366
>>> 2018-11-02 22:25:34,991 INFO  [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>>> Stats of SQL response: isException: false, duration: 20, *total scan
>>> count 1552*
>>>
>>> Acoording the log,  *valueA *= 366. *valueB*= (total scan count) 1552 -
>>> (total Agrrated/filterd in hbase)270 = 1282
>>>  *valueB *is much larger than *valueA *.
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "JiaTao Tao"<taojia...@gmail.com>;
>>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>>> *收件人:* "user"<u...@kylin.apache.org>;
>>> *主题:* Re: doubt about measure of processedRowCount
>>>
>>> Can you grep logs like "to merge segment results" in that scenario?
>>>
>>> cheney <531014...@qq.com> 于2018年11月3日周六 下午4:15写道：
>>>
>>>> Thank your repling, .but I  am sure there's only one OlapContext in the
>>>> quey in my scenario.
>>>> ---Original---
>>>> *From:* "JiaTao Tao"<taojia...@gmail.com>
>>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>>> *To:* "user"<u...@kylin.apache.org>;
>>>> *Subject:* Re: doubt about measure of processedRowCount
>>>>
>>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>>> there's more than one OlapContext in the query ( one OlapContext correspond
>>>> one storageContext ).
>>>>
>>>> There are two good blogs about Kylin's query engine, you may take a
>>>> look :).
>>>>
>>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>>
>>>> https://zhuanlan.zhihu.com/p/30613434
>>>>
>>>> cheney <531014...@qq.com> 于2018年11月2日周五 下午11:10写道：
>>>>
>>>>> Hi, guys
>>>>>
>>>>>         When I executed a sql in kylin, kylin server will log some log
>>>>> about query statics. for example, The log is as following:
>>>>>
>>>>>        "Processed rows for each storageContext: *valueA*". *valueA *is 
>>>>> processedRowCount.
>>>>>
>>>>>        What I understand is processedRowCount is the record rows
>>>>> numbers returned by hbase.
>>>>>
>>>>>        Hbase corprocessor will log region stats, including:  "*Total
>>>>> scanned row*","Total filtered/aggred row".
>>>>>
>>>>>         For  one region,  final records returned by hbase = *Total scanned
>>>>> row - *Total filtered/aggred row;
>>>>>        Suppose this query need to scan 10 region in hbase, we can get
>>>>> every region stats. we can get all records  *valueB *returned by
>>>>> hbase by
>>>>>        suming every final records in 10 region.
>>>>>
>>>>>       In general, *valueA *is equal to * valueB*, but *valueB *is
>>>>> much larger than *valueA* in sometimes. Why?
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> Regards!
>>>>
>>>> Aron Tao
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: doubt about measure of processedRowCount

Reply via email to