Re: Profiling data quality with Spark

vaquar khan Wed, 28 Dec 2022 08:29:28 -0800

Here you can find all details , you just need to pass spark dataframe and
deequ also generate recommendations for rules and you can also write custom
complex rules.


https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/

Regards,
Vaquar khan

On Wed, Dec 28, 2022, 9:40 AM rajat kumar <[email protected]>
wrote:

> Thanks for the input folks.
>
> Hi Vaquar ,
>
> I saw that we have various types of checks in GE and Deequ. Could you
> please suggest what types of check did you use for Metric based columns
>
>
> Regards
> Rajat
>
> On Wed, Dec 28, 2022 at 12:15 PM vaquar khan <[email protected]>
> wrote:
>
>> I would suggest Deequ , I have implemented many time easy and effective.
>>
>>
>> Regards,
>> Vaquar khan
>>
>> On Tue, Dec 27, 2022, 10:30 PM ayan guha <[email protected]> wrote:
>>
>>> The way I would approach is to evaluate GE, Deequ (there is a python
>>> binding called pydeequ) and others like Delta Live tables with expectations
>>> from Data Quality feature perspective. All these tools have their pros and
>>> cons, and all of them are compatible with spark as a compute engine.
>>>
>>> Also, you may want to look at dbt based DQ toolsets if sql is your
>>> thing.
>>>
>>> On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen <[email protected]> wrote:
>>>
>>>> I think this is kind of mixed up. Data warehouses are simple SQL
>>>> creatures; Spark is (also) a distributed compute framework. Kind of like
>>>> comparing maybe a web server to Java.
>>>> Are you thinking of Spark SQL? then I dunno sure you may well find it
>>>> more complicated, but it's also just a data warehousey SQL surface.
>>>>
>>>> But none of that relates to the question of data quality tools. You
>>>> could use GE with Redshift, or indeed with Spark - are you familiar with
>>>> it? It's probably one of the most common tools people use with Spark for
>>>> this in fact. It's just a Python lib at heart and you can apply it with
>>>> Spark, but _not_ with a data warehouse, so I'm not sure what you're getting
>>>> at.
>>>>
>>>> Deequ is also commonly seen. It's actually built on Spark, so again,
>>>> confused about this "use Redshift or Snowflake not Spark".
>>>>
>>>> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> SPARK is just another querying engine with a lot of hype.
>>>>>
>>>>> I would highly suggest using Redshift (storage and compute decoupled
>>>>> mode) or Snowflake without all this super complicated understanding of
>>>>> containers/ disk-space, mind numbing variables, rocket science tuning, 
>>>>> hair
>>>>> splitting failure scenarios, etc. After that try to choose solutions like
>>>>> Athena, or Trino/ Presto, and then come to SPARK.
>>>>>
>>>>> Try out solutions like  "great expectations" if you are looking for
>>>>> data quality and not entirely sucked into the world of SPARK and want to
>>>>> keep your options open.
>>>>>
>>>>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>>>>> superb alternatives now and the industry, in this recession, should focus
>>>>> on getting more value for every single dollar they spend.
>>>>>
>>>>> Best of luck.
>>>>>
>>>>> Regards,
>>>>> Gourav Sengupta
>>>>>
>>>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Well, you need to qualify your statement on data quality. Are you
>>>>>> talking about data lineage here?
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Folks
>>>>>>> Hoping you are doing well, I want to implement data quality to
>>>>>>> detect issues in data in advance. I have heard about few frameworks like
>>>>>>> GE/Deequ. Can anyone pls suggest which one is good and how do I get 
>>>>>>> started
>>>>>>> on it?
>>>>>>>
>>>>>>> Regards
>>>>>>> Rajat
>>>>>>>
>>>>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>

Re: Profiling data quality with Spark

Reply via email to