Re:Re: Re: Re: Re: Griffin Accuracy Doubts

Lionel Liu Thu, 30 Nov 2017 06:12:48 -0800

Hi Ananthanarayanan.M.S,
  Thanks for your reply. I think you're right, in this version, the griffin-dsl 
can deal with accuracy and profiling cases, but for most users they prefer sql. 
  Actually, in this version, considering of streaming mode, we made some 
specific process for spark sql, it could deal with some profiling cases, but 
fails in other ones. We'll fix this bug, and try to keep sql directly for the 
measure engine, you'll be able to use sql in next version.





--

Regards,
Lionel, Liu

At 2017-11-29 12:07:39, "Ananthanarayanan Ms" 
<ananthanarayanan.ms...@nielsen.com> wrote:

Hi Lionel,
 Sure, will try griffin-dsl and update here. I was willing to keep a distance 
from this feature griffin-dsl as it needed a learning curve on how to use the 
expression and attributes and we are planning for our end users who would not 
have this background and if not then we need a intermediate step to translate 
their need to this dsl unlike sql where there are proficient already.


Thank you.




Regards,
Ananthanarayanan.M.S


On Tue, Nov 28, 2017 at 8:38 AM, Lionel Liu <bhlx3l...@163.com> wrote:

Hi Ananthanarayanan.M.S,
Sorry for the late reply, I think you are trying accuracy measure by 
configuring the field "dsl.type" as "spark-sql", not as "griffin-dsl", right?
If you just want to use accuracy, you'd better try "griffin-dsl", and configure 
"dq.type" as "accuracy", with "rule" like this "source.name = target.name and 
source.age = target.age".
For "spark-sql" type, we add a column "__tmst" into the sql, and group by it, 
to fit some streaming cases, but it does need some background knowledge, for 
example, like accuracy case, when there're multiple data sources, sql engine 
can not recognize the column "__tmst" belongs to source or target, so it is 
ambiguous, we'll fix this bug in later version.
After all, for accuracy cases, we recommend you use the configuration of rule 
like this:
{
  "dsl.type": "griffin-dsl",
  "dq.type": "accuracy",
  "rule": "source.name = target.name and source.age = target.age"
}




Hope this can help you, thanks a lot.




--

Regards,
Lionel, Liu

At 2017-11-21 20:42:45, "Ananthanarayanan Ms" 
<ananthanarayanan.ms...@nielsen.com> wrote:

Hi Lionel,
 Even after changing, the issue still remains and looks like during join as 
both src and tgt are already populated withColumn __tmst hence here it shows 
'__tmst' is ambiguous. Could you please help to check and clarify.


17/11/21 08:38:46 ERROR engine.SparkSqlEngine: run spark sql [ SELECT `__tmst` 
AS `__tmst`, source.product_id, source.market_id, 
source.period_id,source.s5_volume_rom_distrib FROM source LEFT JOIN target ON 
coalesce(source.period_id, 'null') = coalesce(target.period_id, 'null') and 
coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and 
coalesce(source.product_id, 'null') = coalesce(target.product_id, 'null') WHERE 
(NOT (source.product_id IS NULL AND source.market_id IS NULL AND source. 
period_id IS NULL)) AND target.product_id IS NULL AND target.market_id IS NULL 
AND target.period_id IS NULL GROUP BY `__tmst` ] error: Reference '__tmst' is 
ambiguous, could be: __tmst#145L, __tmst#146L.; line 1 pos 566
 




Regards,
Ananthanarayanan.M.S


On Tue, Nov 21, 2017 at 8:05 AM, Lionel Liu <bhlx3l...@163.com> wrote:

Hi Ananthanarayanan.M.S,


Actually, we add a new column `__tmst` for each table, to persist the timestamp 
of each data row.
But the sql you list seems not like our exact logic, so I want to see your 
accuracy rule statement to have a check.


if your accuracy rule statement is like:
"source.product_id = target.product_id, source.market_id = target.market_id, 
source.period_id = target.period_id",
the sql for that step should be:
"SELECT source.* FROM source LEFT JOIN target ON coalesce(source. product_id, 
'null') = coalesce(target. product_id, 'null') and coalesce(source.market_id, 
'null') = coalesce(target.market_id, 'null') and coalesce(source. period_id, 
'null') = coalesce(target. period_id, 'null') WHERE (NOT (source.product_id IS 
NULL AND source.market_id IS NULL AND source. period_id IS NULL)) AND 
target.product_id IS NULL AND source.market_id IS NULL AND source. period_id IS 
NULL".


Hope this can help you, thanks.



--

Regards,
Lionel, Liu



At 2017-11-21 06:50:09, "William Guo" <gu...@apache.org> wrote:
>hello Ananthanarayanan.M.S,
>
>We support some partitioned tables, could you show us your partitioned table 
>description for check and some sample data will be helpful?
>
>Thanks,
>William
>________________________________
>From: Ananthanarayanan Ms <ananthanarayanan.ms...@nielsen.com>
>Sent: Tuesday, November 21, 2017 3:53 AM
>To: William Guo
>Cc: dev@griffin.incubator.apache.org; Abishek Kunduru
>Subject: Re: Re: Griffin Accuracy Doubts
>
>Hello William / Lionel,
>Could you please let us know if we can use 1.6. Could you please let know if 
>the partitioned tables alone can be input as src/tgt as below err is got when 
>we build and run the master code.
>
>17/11/20 14:51:13 ERROR engine.SparkSqlEngine: run spark sql [ SELECT `__tmst` 
>AS `__tmst`, source.product_id, source.market_id, 
>source.period_id,s5_volume_rom_distrib FROM source LEFT JOIN target ON 
>coalesce(source.period_id, 'null') = coalesce(target.period_id, 'null') and 
>coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and 
>coalesce(source.product_id, 'null') = coalesce(target.product_id, 'null') 
>WHERE (NOT (source.product_id IS NULL )) GROUP BY `__tmst` ] error: Reference 
>'__tmst' is ambiguous, could be: __tmst#145L, __tmst#146L.; line 1 pos 413
>
>
>
>Regards,
>Ananthanarayanan.M.S
>
>On Wed, Oct 4, 2017 at 2:39 PM, William Guo 
><gu...@apache.org<mailto:gu...@apache.org>> wrote:
>
>hi Abishek,
>
>
>For SQL support, the version should be 0.1.6, we will release in Middle of 
>October.
>
>
>
>Thanks,
>
>William
>
>
>________________________________
>From: Abishek Kunduru 
><abishek.kund...@nielsen.com<mailto:abishek.kund...@nielsen.com>>
>Sent: Tuesday, October 3, 2017 12:35 AM
>To: 刘力力
>Cc: dev@griffin.incubator.apache.org<mailto:dev@griffin.incubator.apache.org>; 
>Vidyaranya Malladi; gu...@apache.org<mailto:gu...@apache.org>; Mara Preotescu; 
>Ananthanarayanan Ms
>Subject: Re: Re: Griffin Accuracy Doubts
>
>Thanks William and Lionel for the quick response. Can you please tell us when 
>would the Griffin with SQL support be released and its version number?
>
>Regarding the second question, Vidya is not referring to changing the source 
>or target data but instead DQ feature to compare source and target columns and 
>output the difference in terms of percentage change (Vidya gave an example as 
>well in the below table which outputs the difference between source and target 
>numeric columns).
>
>Best,
>Abishek
>
>
>On Fri, Sep 29, 2017 at 10:13 PM, 刘力力 
><bhlx3l...@163.com<mailto:bhlx3l...@163.com>> wrote:
>Hi Vidyaranya,
>
>I have some more comments about the answers from William. Pls check the 
>following comments.
>
>
>1. Does Griffin plan to support multi table being src eg. src table frame from 
>a qry joined thro tabl1,tabl2,tabln and selected few cols and then finally use 
>that as src and similarly for destination as well
>
>--Yes, Since we have change our backend to fully support sql.
>
>2. Would Griffin support the result metric col wise instead of overall matched 
>count.
>eg if src table is having 4 cols and dest is having 4 or more cols and user 
>wants to check the accuracy of col4 which is 'price' in src and destination by 
>having a new col added to the src table as '%change' which would be 
>abs(src_price - dest_price)/100
>
>--Yes, Since we have change our backend to fully support sql.
>
>Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC 
>Data        TGT Data        % Chg
>f000000000000000000100000000000000000000        Units   Sales Units     1      
> 25510.00        2551.00 0.00
>f000000000000000000100000000000000000001        Units (any promo)       Sales 
>Units (any promo) 1       5846.00 584.59  0.00
>f000000000000000000100000000000000000002        Units (no promo)        Sales 
>Units (no promo)  1       17172.00        1717.25 0.00
>
>
>--[commont from Lionel] Actually, I think the modification of data source is 
>like some ETL task, not the DQ domain, we consider that the DQ result from 
>griffin should be simple like a value or percentage or group of it, that might 
>be better for DQ check.
>
>
>3. Support udfs on the cols before comparing, eg one of our src is a table 
>from mainframe format where dots/decimals are ignore and we have to apply a 
>logic to get the decimals
>
>--For UDFs, if it registered in your computing cluster, it should be fine from 
>command line. we are considering to refactor ui to support sql editor directly.
>but how to provide better user experience from front end for UDF, we are still 
>considering it.
>
>4. Last time we discussed few features were said to be discussed to support 
>could you please let us know if we have any road map for the same. like 
>advanced sql features and adding groupby clause/udfs etc
>
>--Support SQL is already done, will have a quick release after we fully tested 
>and reviewed.
>
>5. Griffin compares the cols selected by using them in joins, do we plan to 
>enhance to check if the user could select few cols other than the mentioned in 
>joins and we need to compare the src col's value vs dest col's value lets say 
>without the join and only by order by clause.
>
>
>select col from src order by prikey
>select col from tgt order by prikey
>
>src.col vs tgt.col instead of the join key . Please let us know.
>
>-- For this case, it depends on how would you like to map src.col with 
>tgt.col, we don't need join key if you can elaborate mapping in sql since we 
>are fully supporting sql.
>
>--[comment from Lionel] We support sql directly, if you can describe your case 
>in sql, we can do that.
>
>Thanks very much.
>
>
>--
>Regards,
>Lionel, Liu
>
>
>At 2017-09-30 08:23:39, "William Guo" 
><guo...@outlook.com<mailto:guo...@outlook.com>> wrote:
>>hi Vidyaranya,
>>
>>
>>Please check my comments inline.
>>
>>
>>Thanks,
>>
>>William
>>
>>
>>________________________________
>>From: Vidyaranya Malladi 
>><vidyaranya.malladi...@nielsen.com<mailto:vidyaranya.malladi...@nielsen.com>>
>>Sent: Friday, September 29, 2017 10:06 PM
>>To: gu...@apache.org<mailto:gu...@apache.org>; 
>>guo...@outlook.com<mailto:guo...@outlook.com>
>>Cc: 
>>dev@griffin.incubator.apache.org<mailto:dev@griffin.incubator.apache.org>; 
>>Mara Preotescu; Abishek Kunduru; Ananthanarayanan Ms
>>Subject: Griffin Accuracy Doubts
>>
>>Dear William,
>> Good day. Please let us know on the below questions that we have
>>
>>1. Does Griffin plan to support multi table being src eg. src table frame 
>>from a qry joined thro tabl1,tabl2,tabln and selected few cols and then 
>>finally use that as src and similarly for destination as well
>>
>>--Yes, Since we have change our backend to fully support sql.
>>
>>2. Would Griffin support the result metric col wise instead of overall 
>>matched count.
>>eg if src table is having 4 cols and dest is having 4 or more cols and user 
>>wants to check the accuracy of col4 which is 'price' in src and destination 
>>by having a new col added to the src table as '%change' which would be 
>>abs(src_price - dest_price)/100
>>
>>--Yes, Since we have change our backend to fully support sql.
>>
>>Fact_tag        fact_sDesc      fat_Ldesc       SRC Fact_precision      SRC 
>>Data        TGT Data        % Chg
>>f000000000000000000100000000000000000000        Units   Sales Units     1     
>>  25510.00        2551.00 0.00
>>f000000000000000000100000000000000000001        Units (any promo)       Sales 
>>Units (any promo) 1       5846.00 584.59  0.00
>>f000000000000000000100000000000000000002        Units (no promo)        Sales 
>>Units (no promo)  1       17172.00        1717.25 0.00
>>
>>
>>3. Support udfs on the cols before comparing, eg one of our src is a table 
>>from mainframe format where dots/decimals are ignore and we have to apply a 
>>logic to get the decimals
>>
>>--For UDFs, if it registered in your computing cluster, it should be fine 
>>from command line. we are considering to refactor ui to support sql editor 
>>directly.
>>but how to provide better user experience from front end for UDF, we are 
>>still considering it.
>>
>>4. Last time we discussed few features were said to be discussed to support 
>>could you please let us know if we have any road map for the same. like 
>>advanced sql features and adding groupby clause/udfs etc
>>
>>--Support SQL is already done, will have a quick release after we fully 
>>tested and reviewed.
>>
>>5. Griffin compares the cols selected by using them in joins, do we plan to 
>>enhance to check if the user could select few cols other than the mentioned 
>>in joins and we need to compare the src col's value vs dest col's value lets 
>>say without the join and only by order by clause.
>>
>>
>>select col from src order by prikey
>>select col from tgt order by prikey
>>
>>src.col vs tgt.col instead of the join key . Please let us know.
>>
>>-- For this case, it depends on how would you like to map src.col with 
>>tgt.col, we don't need join key if you can elaborate mapping in sql since we 
>>are fully supporting sql.
>>
>>Thank you very much.
>>
>>
>>
>>Thanks & Regards
>>M. Vidyaranya
>>NDX
>>TCS - Nielsen Relationship
>>Mobile  : +91 9790591311
>
>
>
>【网易自营|30天无忧退货】仅售同款价1/4！MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>  
><http://you.163.com/item/detail?id=1165011&from=web_gg_mail_jiaobiao_9>
>
>

Re:Re: Re: Re: Re: Griffin Accuracy Doubts

Reply via email to