Re: Re: Re: Re: Re: Griffin Accuracy Doubts

Ananthanarayanan Ms Thu, 30 Nov 2017 07:05:30 -0800

Hi Lionel,
Thank you and the griffin-dsl works for this usecase and yes, awaiting for
the next release to capture other usecase as one of our usecase is to use
more than 1 table joined by a sql and use it as src vs again multi sliced
table joined as tgt sql and do a accuracy check.



Regards,
Ananthanarayanan.M.S

On Thu, Nov 30, 2017 at 7:42 PM, Lionel Liu <bhlx3l...@163.com> wrote:

> Hi Ananthanarayanan.M.S,
>   Thanks for your reply. I think you're right, in this version, the
> griffin-dsl can deal with accuracy and profiling cases, but for most users
> they prefer sql.
>   Actually, in this version, considering of streaming mode, we made some
> specific process for spark sql, it could deal with some profiling cases,
> but fails in other ones. We'll fix this bug, and try to keep sql directly
> for the measure engine, you'll be able to use sql in next version.
>
>
>
> --
> Regards,
> Lionel, Liu
>
> At 2017-11-29 12:07:39, "Ananthanarayanan Ms" <ananthanarayanan.ms.ap@
> nielsen.com> wrote:
>
> Hi Lionel,
>  Sure, will try griffin-dsl and update here. I was willing to keep a
> distance from this feature griffin-dsl as it needed a learning curve on how
> to use the expression and attributes and we are planning for our end users
> who would not have this background and if not then we need a intermediate
> step to translate their need to this dsl unlike sql where there are
> proficient already.
>
> Thank you.
>
>
> Regards,
> Ananthanarayanan.M.S
>
> On Tue, Nov 28, 2017 at 8:38 AM, Lionel Liu <bhlx3l...@163.com> wrote:
>
>> Hi Ananthanarayanan.M.S,
>> Sorry for the late reply, I think you are trying accuracy measure by
>> configuring the field "dsl.type" as "spark-sql", not as "griffin-dsl",
>> right?
>> If you just want to use accuracy, you'd better try "griffin-dsl", and
>> configure "dq.type" as "accuracy", with "rule" like this "source.name =
>> target.name and source.age = target.age".
>> For "spark-sql" type, we add a column "__tmst" into the sql, and group by
>> it, to fit some streaming cases, but it does need some background
>> knowledge, for example, like accuracy case, when there're multiple data
>> sources, sql engine can not recognize the column "__tmst" belongs to source
>> or target, so it is ambiguous, we'll fix this bug in later version.
>> After all, for accuracy cases, we recommend you use the configuration of
>> rule like this:
>> {
>>   "dsl.type": "griffin-dsl",
>>   "dq.type": "accuracy",
>>   "rule": "source.name = target.name and source.age = target.age"
>> }
>>
>>
>> Hope this can help you, thanks a lot.
>>
>>
>>
>> --
>> Regards,
>> Lionel, Liu
>>
>> At 2017-11-21 20:42:45, "Ananthanarayanan Ms" <
>> ananthanarayanan.ms...@nielsen.com> wrote:
>>
>> Hi Lionel,
>>  Even after changing, the issue still remains and looks like during join
>> as both src and tgt are already populated withColumn *__tmst *hence here
>> it shows* '__tmst' is ambiguous. *Could you please help to check and
>> clarify.
>>
>> 17/11/21 08:38:46 ERROR engine.SparkSqlEngine: run spark sql [ SELECT
>> `__tmst` AS `__tmst`, source.product_id, source.market_id,
>> source.period_id,source.s5_volume_rom_distrib FROM source LEFT JOIN
>> target *ON *coalesce(source.period_id, 'null') =
>> coalesce(target.period_id, 'null') and coalesce(source.market_id, 'null') =
>> coalesce(target.market_id, 'null') and coalesce(source.product_id, 'null')
>> = coalesce(target.product_id, 'null') *WHERE *(NOT (source.product_id IS
>> NULL AND source.market_id IS NULL AND source. period_id IS NULL)) AND
>> target.product_id IS NULL AND target.market_id IS NULL AND target.period_id
>> IS NULL GROUP BY `__tmst` ] error: Reference '__tmst' is ambiguous, could
>> be: __tmst#145L, __tmst#146L.; line 1 pos 566
>>
>>
>>
>> Regards,
>> Ananthanarayanan.M.S
>>
>> On Tue, Nov 21, 2017 at 8:05 AM, Lionel Liu <bhlx3l...@163.com> wrote:
>>
>>> Hi Ananthanarayanan.M.S,
>>>
>>> Actually, we add a new column `__tmst` for each table, to persist the
>>> timestamp of each data row.
>>> But the sql you list seems not like our exact logic, so I want to see
>>> your accuracy rule statement to have a check.
>>>
>>> if your accuracy rule statement is like:
>>> "source.product_id = target.product_id, source.market_id =
>>> target.market_id, source.period_id = target.period_id",
>>> the sql for that step should be:
>>> "SELECT source.* FROM source LEFT JOIN target ON coalesce(source.
>>> product_id, 'null') = coalesce(target. product_id, 'null') and
>>> coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and
>>> coalesce(source. period_id, 'null') = coalesce(target. period_id,
>>> 'null') WHERE (NOT (source.product_id IS NULL AND source.market_id IS
>>> NULL AND source. period_id IS NULL)) AND target.product_id IS NULL AND
>>> source.market_id IS NULL AND source. period_id IS NULL".
>>>
>>> Hope this can help you, thanks.
>>>
>>>
>>> --
>>> Regards,
>>> Lionel, Liu
>>>
>>>
>>> At 2017-11-21 06:50:09, "William Guo" <gu...@apache.org> wrote:
>>> >hello Ananthanarayanan.M.S,
>>> >
>>> >We support some partitioned tables, could you show us your partitioned 
>>> >table description for check and some sample data will be helpful?
>>> >
>>> >Thanks,
>>> >William
>>> >________________________________
>>> >From: Ananthanarayanan Ms <ananthanarayanan.ms...@nielsen.com>
>>> >Sent: Tuesday, November 21, 2017 3:53 AM
>>> >To: William Guo
>>> >Cc: dev@griffin.incubator.apache.org; Abishek Kunduru
>>> >Subject: Re: Re: Griffin Accuracy Doubts
>>> >
>>> >Hello William / Lionel,
>>> >Could you please let us know if we can use 1.6. Could you please let know 
>>> >if the partitioned tables alone can be input as src/tgt as below err is 
>>> >got when we build and run the master code.
>>> >
>>> >17/11/20 14:51:13 ERROR engine.SparkSqlEngine: run spark sql [ SELECT 
>>> >`__tmst` AS `__tmst`, source.product_id, source.market_id, 
>>> >source.period_id,s5_volume_rom_distrib FROM source LEFT JOIN target ON 
>>> >coalesce(source.period_id, 'null') = coalesce(target.period_id, 'null') 
>>> >and coalesce(source.market_id, 'null') = coalesce(target.market_id, 
>>> >'null') and coalesce(source.product_id, 'null') = 
>>> >coalesce(target.product_id, 'null') WHERE (NOT (source.product_id IS NULL 
>>> >)) GROUP BY `__tmst` ] error: Reference '__tmst' is ambiguous, could be: 
>>> >__tmst#145L, __tmst#146L.; line 1 pos 413
>>> >
>>> >
>>> >
>>> >Regards,
>>> >Ananthanarayanan.M.S
>>> >
>>> >On Wed, Oct 4, 2017 at 2:39 PM, William Guo 
>>> ><gu...@apache.org<mailto:gu...@apache.org>> wrote:
>>> >
>>> >hi Abishek,
>>> >
>>> >
>>> >For SQL support, the version should be 0.1.6, we will release in Middle of 
>>> >October.
>>> >
>>> >
>>> >
>>> >Thanks,
>>> >
>>> >William
>>> >
>>> >
>>> >________________________________
>>> >From: Abishek Kunduru 
>>> ><abishek.kund...@nielsen.com<mailto:abishek.kund...@nielsen.com>>
>>> >Sent: Tuesday, October 3, 2017 12:35 AM
>>> >To: 刘力力
>>> >Cc: 
>>> >dev@griffin.incubator.apache.org<mailto:dev@griffin.incubator.apache.org>; 
>>> >Vidyaranya Malladi; gu...@apache.org<mailto:gu...@apache.org>; Mara 
>>> >Preotescu; Ananthanarayanan Ms
>>> >Subject: Re: Re: Griffin Accuracy Doubts
>>> >
>>> >Thanks William and Lionel for the quick response. Can you please tell us 
>>> >when would the Griffin with SQL support be released and its version number?
>>> >
>>> >Regarding the second question, Vidya is not referring to changing the 
>>> >source or target data but instead DQ feature to compare source and target 
>>> >columns and output the difference in terms of percentage change (Vidya 
>>> >gave an example as well in the below table which outputs the difference 
>>> >between source and target numeric columns).
>>> >
>>> >Best,
>>> >Abishek
>>> >
>>> >
>>> >On Fri, Sep 29, 2017 at 10:13 PM, 刘力力 
>>> ><bhlx3l...@163.com<mailto:bhlx3l...@163.com>> wrote:
>>> >Hi Vidyaranya,
>>> >
>>> >I have some more comments about the answers from William. Pls check the 
>>> >following comments.
>>> >
>>> >
>>> >1. Does Griffin plan to support multi table being src eg. src table frame 
>>> >from a qry joined thro tabl1,tabl2,tabln and selected few cols and then 
>>> >finally use that as src and similarly for destination as well
>>> >
>>> >--Yes, Since we have change our backend to fully support sql.
>>> >
>>> >2. Would Griffin support the result metric col wise instead of overall 
>>> >matched count.
>>> >eg if src table is having 4 cols and dest is having 4 or more cols and 
>>> >user wants to check the accuracy of col4 which is 'price' in src and 
>>> >destination by having a new col added to the src table as '%change' which 
>>> >would be abs(src_price - dest_price)/100
>>> >
>>> >--Yes, Since we have change our backend to fully support sql.
>>> >
>>> >Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      
>>> >SRC Data        TGT Data        % Chg
>>> >f000000000000000000100000000000000000000        Units   Sales Units     1  
>>> >     25510.00        2551.00 0.00
>>> >f000000000000000000100000000000000000001        Units (any promo)       
>>> >Sales Units (any promo) 1       5846.00 584.59  0.00
>>> >f000000000000000000100000000000000000002        Units (no promo)        
>>> >Sales Units (no promo)  1       17172.00        1717.25 0.00
>>> >
>>> >
>>> >--[commont from Lionel] Actually, I think the modification of data source 
>>> >is like some ETL task, not the DQ domain, we consider that the DQ result 
>>> >from griffin should be simple like a value or percentage or group of it, 
>>> >that might be better for DQ check.
>>> >
>>> >
>>> >3. Support udfs on the cols before comparing, eg one of our src is a table 
>>> >from mainframe format where dots/decimals are ignore and we have to apply 
>>> >a logic to get the decimals
>>> >
>>> >--For UDFs, if it registered in your computing cluster, it should be fine 
>>> >from command line. we are considering to refactor ui to support sql editor 
>>> >directly.
>>> >but how to provide better user experience from front end for UDF, we are 
>>> >still considering it.
>>> >
>>> >4. Last time we discussed few features were said to be discussed to 
>>> >support could you please let us know if we have any road map for the same. 
>>> >like advanced sql features and adding groupby clause/udfs etc
>>> >
>>> >--Support SQL is already done, will have a quick release after we fully 
>>> >tested and reviewed.
>>> >
>>> >5. Griffin compares the cols selected by using them in joins, do we plan 
>>> >to enhance to check if the user could select few cols other than the 
>>> >mentioned in joins and we need to compare the src col's value vs dest 
>>> >col's value lets say without the join and only by order by clause.
>>> >
>>> >
>>> >select col from src order by prikey
>>> >select col from tgt order by prikey
>>> >
>>> >src.col vs tgt.col instead of the join key . Please let us know.
>>> >
>>> >-- For this case, it depends on how would you like to map src.col with 
>>> >tgt.col, we don't need join key if you can elaborate mapping in sql since 
>>> >we are fully supporting sql.
>>> >
>>> >--[comment from Lionel] We support sql directly, if you can describe your 
>>> >case in sql, we can do that.
>>> >
>>> >Thanks very much.
>>> >
>>> >
>>> >--
>>> >Regards,
>>> >Lionel, Liu
>>> >
>>> >
>>> >At 2017-09-30 08:23:39, "William Guo" 
>>> ><guo...@outlook.com<mailto:guo...@outlook.com>> wrote:
>>> >>hi Vidyaranya,
>>> >>
>>> >>
>>> >>Please check my comments inline.
>>> >>
>>> >>
>>> >>Thanks,
>>> >>
>>> >>William
>>> >>
>>> >>
>>> >>________________________________
>>> >>From: Vidyaranya Malladi 
>>> >><vidyaranya.malladi...@nielsen.com<mailto:vidyaranya.malladi...@nielsen.com>>
>>> >>Sent: Friday, September 29, 2017 10:06 PM
>>> >>To: gu...@apache.org<mailto:gu...@apache.org>; 
>>> >>guo...@outlook.com<mailto:guo...@outlook.com>
>>> >>Cc: 
>>> >>dev@griffin.incubator.apache.org<mailto:dev@griffin.incubator.apache.org>;
>>> >> Mara Preotescu; Abishek Kunduru; Ananthanarayanan Ms
>>> >>Subject: Griffin Accuracy Doubts
>>> >>
>>> >>Dear William,
>>> >> Good day. Please let us know on the below questions that we have
>>> >>
>>> >>1. Does Griffin plan to support multi table being src eg. src table frame 
>>> >>from a qry joined thro tabl1,tabl2,tabln and selected few cols and then 
>>> >>finally use that as src and similarly for destination as well
>>> >>
>>> >>--Yes, Since we have change our backend to fully support sql.
>>> >>
>>> >>2. Would Griffin support the result metric col wise instead of overall 
>>> >>matched count.
>>> >>eg if src table is having 4 cols and dest is having 4 or more cols and 
>>> >>user wants to check the accuracy of col4 which is 'price' in src and 
>>> >>destination by having a new col added to the src table as '%change' which 
>>> >>would be abs(src_price - dest_price)/100
>>> >>
>>> >>--Yes, Since we have change our backend to fully support sql.
>>> >>
>>> >>Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      
>>> >>SRC Data        TGT Data        % Chg
>>> >>f000000000000000000100000000000000000000        Units   Sales Units     1 
>>> >>      25510.00        2551.00 0.00
>>> >>f000000000000000000100000000000000000001        Units (any promo)       
>>> >>Sales Units (any promo) 1       5846.00 584.59  0.00
>>> >>f000000000000000000100000000000000000002        Units (no promo)        
>>> >>Sales Units (no promo)  1       17172.00        1717.25 0.00
>>> >>
>>> >>
>>> >>3. Support udfs on the cols before comparing, eg one of our src is a 
>>> >>table from mainframe format where dots/decimals are ignore and we have to 
>>> >>apply a logic to get the decimals
>>> >>
>>> >>--For UDFs, if it registered in your computing cluster, it should be fine 
>>> >>from command line. we are considering to refactor ui to support sql 
>>> >>editor directly.
>>> >>but how to provide better user experience from front end for UDF, we are 
>>> >>still considering it.
>>> >>
>>> >>4. Last time we discussed few features were said to be discussed to 
>>> >>support could you please let us know if we have any road map for the 
>>> >>same. like advanced sql features and adding groupby clause/udfs etc
>>> >>
>>> >>--Support SQL is already done, will have a quick release after we fully 
>>> >>tested and reviewed.
>>> >>
>>> >>5. Griffin compares the cols selected by using them in joins, do we plan 
>>> >>to enhance to check if the user could select few cols other than the 
>>> >>mentioned in joins and we need to compare the src col's value vs dest 
>>> >>col's value lets say without the join and only by order by clause.
>>> >>
>>> >>
>>> >>select col from src order by prikey
>>> >>select col from tgt order by prikey
>>> >>
>>> >>src.col vs tgt.col instead of the join key . Please let us know.
>>> >>
>>> >>-- For this case, it depends on how would you like to map src.col with 
>>> >>tgt.col, we don't need join key if you can elaborate mapping in sql since 
>>> >>we are fully supporting sql.
>>> >>
>>> >>Thank you very much.
>>> >>
>>> >>
>>> >>
>>> >>Thanks & Regards
>>> >>M. Vidyaranya
>>> >>NDX
>>> >>TCS - Nielsen Relationship
>>> >>Mobile  : +91 9790591311
>>> >
>>> >
>>> >
>>> >【网易自营|30天无忧退货】仅售同款价1/4！MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>  
>>> ><http://you.163.com/item/detail?id=1165011&from=web_gg_mail_jiaobiao_9>
>>> >
>>> >
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>
>
>
>

Re: Re: Re: Re: Re: Griffin Accuracy Doubts

Reply via email to