Here you can find all details , you just need to pass spark dataframe and deequ also generate recommendations for rules and you can also write custom complex rules.
https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/ Regards, Vaquar khan On Wed, Dec 28, 2022, 9:40 AM rajat kumar <[email protected]> wrote: > Thanks for the input folks. > > Hi Vaquar , > > I saw that we have various types of checks in GE and Deequ. Could you > please suggest what types of check did you use for Metric based columns > > > Regards > Rajat > > On Wed, Dec 28, 2022 at 12:15 PM vaquar khan <[email protected]> > wrote: > >> I would suggest Deequ , I have implemented many time easy and effective. >> >> >> Regards, >> Vaquar khan >> >> On Tue, Dec 27, 2022, 10:30 PM ayan guha <[email protected]> wrote: >> >>> The way I would approach is to evaluate GE, Deequ (there is a python >>> binding called pydeequ) and others like Delta Live tables with expectations >>> from Data Quality feature perspective. All these tools have their pros and >>> cons, and all of them are compatible with spark as a compute engine. >>> >>> Also, you may want to look at dbt based DQ toolsets if sql is your >>> thing. >>> >>> On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen <[email protected]> wrote: >>> >>>> I think this is kind of mixed up. Data warehouses are simple SQL >>>> creatures; Spark is (also) a distributed compute framework. Kind of like >>>> comparing maybe a web server to Java. >>>> Are you thinking of Spark SQL? then I dunno sure you may well find it >>>> more complicated, but it's also just a data warehousey SQL surface. >>>> >>>> But none of that relates to the question of data quality tools. You >>>> could use GE with Redshift, or indeed with Spark - are you familiar with >>>> it? It's probably one of the most common tools people use with Spark for >>>> this in fact. It's just a Python lib at heart and you can apply it with >>>> Spark, but _not_ with a data warehouse, so I'm not sure what you're getting >>>> at. >>>> >>>> Deequ is also commonly seen. It's actually built on Spark, so again, >>>> confused about this "use Redshift or Snowflake not Spark". >>>> >>>> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta < >>>> [email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> SPARK is just another querying engine with a lot of hype. >>>>> >>>>> I would highly suggest using Redshift (storage and compute decoupled >>>>> mode) or Snowflake without all this super complicated understanding of >>>>> containers/ disk-space, mind numbing variables, rocket science tuning, >>>>> hair >>>>> splitting failure scenarios, etc. After that try to choose solutions like >>>>> Athena, or Trino/ Presto, and then come to SPARK. >>>>> >>>>> Try out solutions like "great expectations" if you are looking for >>>>> data quality and not entirely sucked into the world of SPARK and want to >>>>> keep your options open. >>>>> >>>>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are >>>>> superb alternatives now and the industry, in this recession, should focus >>>>> on getting more value for every single dollar they spend. >>>>> >>>>> Best of luck. >>>>> >>>>> Regards, >>>>> Gourav Sengupta >>>>> >>>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh < >>>>> [email protected]> wrote: >>>>> >>>>>> Well, you need to qualify your statement on data quality. Are you >>>>>> talking about data lineage here? >>>>>> >>>>>> HTH >>>>>> >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Folks >>>>>>> Hoping you are doing well, I want to implement data quality to >>>>>>> detect issues in data in advance. I have heard about few frameworks like >>>>>>> GE/Deequ. Can anyone pls suggest which one is good and how do I get >>>>>>> started >>>>>>> on it? >>>>>>> >>>>>>> Regards >>>>>>> Rajat >>>>>>> >>>>>> -- >>> Best Regards, >>> Ayan Guha >>> >>
