Hi Lionel, Thank you and the griffin-dsl works for this usecase and yes, awaiting for the next release to capture other usecase as one of our usecase is to use more than 1 table joined by a sql and use it as src vs again multi sliced table joined as tgt sql and do a accuracy check.
Regards, Ananthanarayanan.M.S On Thu, Nov 30, 2017 at 7:42 PM, Lionel Liu <bhlx3l...@163.com> wrote: > Hi Ananthanarayanan.M.S, > Thanks for your reply. I think you're right, in this version, the > griffin-dsl can deal with accuracy and profiling cases, but for most users > they prefer sql. > Actually, in this version, considering of streaming mode, we made some > specific process for spark sql, it could deal with some profiling cases, > but fails in other ones. We'll fix this bug, and try to keep sql directly > for the measure engine, you'll be able to use sql in next version. > > > > -- > Regards, > Lionel, Liu > > At 2017-11-29 12:07:39, "Ananthanarayanan Ms" <ananthanarayanan.ms.ap@ > nielsen.com> wrote: > > Hi Lionel, > Sure, will try griffin-dsl and update here. I was willing to keep a > distance from this feature griffin-dsl as it needed a learning curve on how > to use the expression and attributes and we are planning for our end users > who would not have this background and if not then we need a intermediate > step to translate their need to this dsl unlike sql where there are > proficient already. > > Thank you. > > > Regards, > Ananthanarayanan.M.S > > On Tue, Nov 28, 2017 at 8:38 AM, Lionel Liu <bhlx3l...@163.com> wrote: > >> Hi Ananthanarayanan.M.S, >> Sorry for the late reply, I think you are trying accuracy measure by >> configuring the field "dsl.type" as "spark-sql", not as "griffin-dsl", >> right? >> If you just want to use accuracy, you'd better try "griffin-dsl", and >> configure "dq.type" as "accuracy", with "rule" like this "source.name = >> target.name and source.age = target.age". >> For "spark-sql" type, we add a column "__tmst" into the sql, and group by >> it, to fit some streaming cases, but it does need some background >> knowledge, for example, like accuracy case, when there're multiple data >> sources, sql engine can not recognize the column "__tmst" belongs to source >> or target, so it is ambiguous, we'll fix this bug in later version. >> After all, for accuracy cases, we recommend you use the configuration of >> rule like this: >> { >> "dsl.type": "griffin-dsl", >> "dq.type": "accuracy", >> "rule": "source.name = target.name and source.age = target.age" >> } >> >> >> Hope this can help you, thanks a lot. >> >> >> >> -- >> Regards, >> Lionel, Liu >> >> At 2017-11-21 20:42:45, "Ananthanarayanan Ms" < >> ananthanarayanan.ms...@nielsen.com> wrote: >> >> Hi Lionel, >> Even after changing, the issue still remains and looks like during join >> as both src and tgt are already populated withColumn *__tmst *hence here >> it shows* '__tmst' is ambiguous. *Could you please help to check and >> clarify. >> >> 17/11/21 08:38:46 ERROR engine.SparkSqlEngine: run spark sql [ SELECT >> `__tmst` AS `__tmst`, source.product_id, source.market_id, >> source.period_id,source.s5_volume_rom_distrib FROM source LEFT JOIN >> target *ON *coalesce(source.period_id, 'null') = >> coalesce(target.period_id, 'null') and coalesce(source.market_id, 'null') = >> coalesce(target.market_id, 'null') and coalesce(source.product_id, 'null') >> = coalesce(target.product_id, 'null') *WHERE *(NOT (source.product_id IS >> NULL AND source.market_id IS NULL AND source. period_id IS NULL)) AND >> target.product_id IS NULL AND target.market_id IS NULL AND target.period_id >> IS NULL GROUP BY `__tmst` ] error: Reference '__tmst' is ambiguous, could >> be: __tmst#145L, __tmst#146L.; line 1 pos 566 >> >> >> >> Regards, >> Ananthanarayanan.M.S >> >> On Tue, Nov 21, 2017 at 8:05 AM, Lionel Liu <bhlx3l...@163.com> wrote: >> >>> Hi Ananthanarayanan.M.S, >>> >>> Actually, we add a new column `__tmst` for each table, to persist the >>> timestamp of each data row. >>> But the sql you list seems not like our exact logic, so I want to see >>> your accuracy rule statement to have a check. >>> >>> if your accuracy rule statement is like: >>> "source.product_id = target.product_id, source.market_id = >>> target.market_id, source.period_id = target.period_id", >>> the sql for that step should be: >>> "SELECT source.* FROM source LEFT JOIN target ON coalesce(source. >>> product_id, 'null') = coalesce(target. product_id, 'null') and >>> coalesce(source.market_id, 'null') = coalesce(target.market_id, 'null') and >>> coalesce(source. period_id, 'null') = coalesce(target. period_id, >>> 'null') WHERE (NOT (source.product_id IS NULL AND source.market_id IS >>> NULL AND source. period_id IS NULL)) AND target.product_id IS NULL AND >>> source.market_id IS NULL AND source. period_id IS NULL". >>> >>> Hope this can help you, thanks. >>> >>> >>> -- >>> Regards, >>> Lionel, Liu >>> >>> >>> At 2017-11-21 06:50:09, "William Guo" <gu...@apache.org> wrote: >>> >hello Ananthanarayanan.M.S, >>> > >>> >We support some partitioned tables, could you show us your partitioned >>> >table description for check and some sample data will be helpful? >>> > >>> >Thanks, >>> >William >>> >________________________________ >>> >From: Ananthanarayanan Ms <ananthanarayanan.ms...@nielsen.com> >>> >Sent: Tuesday, November 21, 2017 3:53 AM >>> >To: William Guo >>> >Cc: dev@griffin.incubator.apache.org; Abishek Kunduru >>> >Subject: Re: Re: Griffin Accuracy Doubts >>> > >>> >Hello William / Lionel, >>> >Could you please let us know if we can use 1.6. Could you please let know >>> >if the partitioned tables alone can be input as src/tgt as below err is >>> >got when we build and run the master code. >>> > >>> >17/11/20 14:51:13 ERROR engine.SparkSqlEngine: run spark sql [ SELECT >>> >`__tmst` AS `__tmst`, source.product_id, source.market_id, >>> >source.period_id,s5_volume_rom_distrib FROM source LEFT JOIN target ON >>> >coalesce(source.period_id, 'null') = coalesce(target.period_id, 'null') >>> >and coalesce(source.market_id, 'null') = coalesce(target.market_id, >>> >'null') and coalesce(source.product_id, 'null') = >>> >coalesce(target.product_id, 'null') WHERE (NOT (source.product_id IS NULL >>> >)) GROUP BY `__tmst` ] error: Reference '__tmst' is ambiguous, could be: >>> >__tmst#145L, __tmst#146L.; line 1 pos 413 >>> > >>> > >>> > >>> >Regards, >>> >Ananthanarayanan.M.S >>> > >>> >On Wed, Oct 4, 2017 at 2:39 PM, William Guo >>> ><gu...@apache.org<mailto:gu...@apache.org>> wrote: >>> > >>> >hi Abishek, >>> > >>> > >>> >For SQL support, the version should be 0.1.6, we will release in Middle of >>> >October. >>> > >>> > >>> > >>> >Thanks, >>> > >>> >William >>> > >>> > >>> >________________________________ >>> >From: Abishek Kunduru >>> ><abishek.kund...@nielsen.com<mailto:abishek.kund...@nielsen.com>> >>> >Sent: Tuesday, October 3, 2017 12:35 AM >>> >To: 刘力力 >>> >Cc: >>> >dev@griffin.incubator.apache.org<mailto:dev@griffin.incubator.apache.org>; >>> >Vidyaranya Malladi; gu...@apache.org<mailto:gu...@apache.org>; Mara >>> >Preotescu; Ananthanarayanan Ms >>> >Subject: Re: Re: Griffin Accuracy Doubts >>> > >>> >Thanks William and Lionel for the quick response. Can you please tell us >>> >when would the Griffin with SQL support be released and its version number? >>> > >>> >Regarding the second question, Vidya is not referring to changing the >>> >source or target data but instead DQ feature to compare source and target >>> >columns and output the difference in terms of percentage change (Vidya >>> >gave an example as well in the below table which outputs the difference >>> >between source and target numeric columns). >>> > >>> >Best, >>> >Abishek >>> > >>> > >>> >On Fri, Sep 29, 2017 at 10:13 PM, 刘力力 >>> ><bhlx3l...@163.com<mailto:bhlx3l...@163.com>> wrote: >>> >Hi Vidyaranya, >>> > >>> >I have some more comments about the answers from William. Pls check the >>> >following comments. >>> > >>> > >>> >1. Does Griffin plan to support multi table being src eg. src table frame >>> >from a qry joined thro tabl1,tabl2,tabln and selected few cols and then >>> >finally use that as src and similarly for destination as well >>> > >>> >--Yes, Since we have change our backend to fully support sql. >>> > >>> >2. Would Griffin support the result metric col wise instead of overall >>> >matched count. >>> >eg if src table is having 4 cols and dest is having 4 or more cols and >>> >user wants to check the accuracy of col4 which is 'price' in src and >>> >destination by having a new col added to the src table as '%change' which >>> >would be abs(src_price - dest_price)/100 >>> > >>> >--Yes, Since we have change our backend to fully support sql. >>> > >>> >Fact_tag fact_sDesc fat_Ldesc SRC Fact_precision >>> >SRC Data TGT Data % Chg >>> >f000000000000000000100000000000000000000 Units Sales Units 1 >>> > 25510.00 2551.00 0.00 >>> >f000000000000000000100000000000000000001 Units (any promo) >>> >Sales Units (any promo) 1 5846.00 584.59 0.00 >>> >f000000000000000000100000000000000000002 Units (no promo) >>> >Sales Units (no promo) 1 17172.00 1717.25 0.00 >>> > >>> > >>> >--[commont from Lionel] Actually, I think the modification of data source >>> >is like some ETL task, not the DQ domain, we consider that the DQ result >>> >from griffin should be simple like a value or percentage or group of it, >>> >that might be better for DQ check. >>> > >>> > >>> >3. Support udfs on the cols before comparing, eg one of our src is a table >>> >from mainframe format where dots/decimals are ignore and we have to apply >>> >a logic to get the decimals >>> > >>> >--For UDFs, if it registered in your computing cluster, it should be fine >>> >from command line. we are considering to refactor ui to support sql editor >>> >directly. >>> >but how to provide better user experience from front end for UDF, we are >>> >still considering it. >>> > >>> >4. Last time we discussed few features were said to be discussed to >>> >support could you please let us know if we have any road map for the same. >>> >like advanced sql features and adding groupby clause/udfs etc >>> > >>> >--Support SQL is already done, will have a quick release after we fully >>> >tested and reviewed. >>> > >>> >5. Griffin compares the cols selected by using them in joins, do we plan >>> >to enhance to check if the user could select few cols other than the >>> >mentioned in joins and we need to compare the src col's value vs dest >>> >col's value lets say without the join and only by order by clause. >>> > >>> > >>> >select col from src order by prikey >>> >select col from tgt order by prikey >>> > >>> >src.col vs tgt.col instead of the join key . Please let us know. >>> > >>> >-- For this case, it depends on how would you like to map src.col with >>> >tgt.col, we don't need join key if you can elaborate mapping in sql since >>> >we are fully supporting sql. >>> > >>> >--[comment from Lionel] We support sql directly, if you can describe your >>> >case in sql, we can do that. >>> > >>> >Thanks very much. >>> > >>> > >>> >-- >>> >Regards, >>> >Lionel, Liu >>> > >>> > >>> >At 2017-09-30 08:23:39, "William Guo" >>> ><guo...@outlook.com<mailto:guo...@outlook.com>> wrote: >>> >>hi Vidyaranya, >>> >> >>> >> >>> >>Please check my comments inline. >>> >> >>> >> >>> >>Thanks, >>> >> >>> >>William >>> >> >>> >> >>> >>________________________________ >>> >>From: Vidyaranya Malladi >>> >><vidyaranya.malladi...@nielsen.com<mailto:vidyaranya.malladi...@nielsen.com>> >>> >>Sent: Friday, September 29, 2017 10:06 PM >>> >>To: gu...@apache.org<mailto:gu...@apache.org>; >>> >>guo...@outlook.com<mailto:guo...@outlook.com> >>> >>Cc: >>> >>dev@griffin.incubator.apache.org<mailto:dev@griffin.incubator.apache.org>; >>> >> Mara Preotescu; Abishek Kunduru; Ananthanarayanan Ms >>> >>Subject: Griffin Accuracy Doubts >>> >> >>> >>Dear William, >>> >> Good day. Please let us know on the below questions that we have >>> >> >>> >>1. Does Griffin plan to support multi table being src eg. src table frame >>> >>from a qry joined thro tabl1,tabl2,tabln and selected few cols and then >>> >>finally use that as src and similarly for destination as well >>> >> >>> >>--Yes, Since we have change our backend to fully support sql. >>> >> >>> >>2. Would Griffin support the result metric col wise instead of overall >>> >>matched count. >>> >>eg if src table is having 4 cols and dest is having 4 or more cols and >>> >>user wants to check the accuracy of col4 which is 'price' in src and >>> >>destination by having a new col added to the src table as '%change' which >>> >>would be abs(src_price - dest_price)/100 >>> >> >>> >>--Yes, Since we have change our backend to fully support sql. >>> >> >>> >>Fact_tag fact_sDesc fat_Ldesc SRC Fact_precision >>> >>SRC Data TGT Data % Chg >>> >>f000000000000000000100000000000000000000 Units Sales Units 1 >>> >> 25510.00 2551.00 0.00 >>> >>f000000000000000000100000000000000000001 Units (any promo) >>> >>Sales Units (any promo) 1 5846.00 584.59 0.00 >>> >>f000000000000000000100000000000000000002 Units (no promo) >>> >>Sales Units (no promo) 1 17172.00 1717.25 0.00 >>> >> >>> >> >>> >>3. Support udfs on the cols before comparing, eg one of our src is a >>> >>table from mainframe format where dots/decimals are ignore and we have to >>> >>apply a logic to get the decimals >>> >> >>> >>--For UDFs, if it registered in your computing cluster, it should be fine >>> >>from command line. we are considering to refactor ui to support sql >>> >>editor directly. >>> >>but how to provide better user experience from front end for UDF, we are >>> >>still considering it. >>> >> >>> >>4. Last time we discussed few features were said to be discussed to >>> >>support could you please let us know if we have any road map for the >>> >>same. like advanced sql features and adding groupby clause/udfs etc >>> >> >>> >>--Support SQL is already done, will have a quick release after we fully >>> >>tested and reviewed. >>> >> >>> >>5. Griffin compares the cols selected by using them in joins, do we plan >>> >>to enhance to check if the user could select few cols other than the >>> >>mentioned in joins and we need to compare the src col's value vs dest >>> >>col's value lets say without the join and only by order by clause. >>> >> >>> >> >>> >>select col from src order by prikey >>> >>select col from tgt order by prikey >>> >> >>> >>src.col vs tgt.col instead of the join key . Please let us know. >>> >> >>> >>-- For this case, it depends on how would you like to map src.col with >>> >>tgt.col, we don't need join key if you can elaborate mapping in sql since >>> >>we are fully supporting sql. >>> >> >>> >>Thank you very much. >>> >> >>> >> >>> >> >>> >>Thanks & Regards >>> >>M. Vidyaranya >>> >>NDX >>> >>TCS - Nielsen Relationship >>> >>Mobile : +91 9790591311 >>> > >>> > >>> > >>> >【网易自营|30天无忧退货】仅售同款价1/4!MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>> >>> ><http://you.163.com/item/detail?id=1165011&from=web_gg_mail_jiaobiao_9> >>> > >>> > >>> >>> >>> >>> >>> >> >> >> >> >> > > > > >