Re: Profiling data quality with Spark

2022-12-29 Thread Chitral Verma
Hi Rajat, I have worked for years in democratizing data quality for some of the top organizations and I'm also an Apache Griffin Contributor and PMC - so I know a lot about this space. :) Coming back to your original question, there are a lot of data quality options available in the market today

Re: Profiling data quality with Spark

2022-12-28 Thread infa elance
You can also look at informatica data quality that runs on spark. Of course it’s not free but you can sign up for a 30 day free trial. They have both profiling and prebuilt data quality rules and accelerators. Sent from my iPhoneOn Dec 28, 2022, at 10:02 PM, vaquar khan wrote:@ Gourav Sengupta

Re: Profiling data quality with Spark

2022-12-28 Thread vaquar khan
@ Gourav Sengupta why you are sending unnecessary emails ,if you think snowflake good plz use it ,here question was different and you are talking totally different topic. Plz respects group guidelines Regards, Vaquar khan On Wed, Dec 28, 2022, 10:29 AM vaquar khan wrote: > Here you can find

Re: Profiling data quality with Spark

2022-12-28 Thread vaquar khan
Here you can find all details , you just need to pass spark dataframe and deequ also generate recommendations for rules and you can also write custom complex rules. https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/ Regards, Vaquar khan On Wed, Dec 28, 2022, 9:40 AM

Re: Profiling data quality with Spark

2022-12-28 Thread rajat kumar
Thanks for the input folks. Hi Vaquar , I saw that we have various types of checks in GE and Deequ. Could you please suggest what types of check did you use for Metric based columns Regards Rajat On Wed, Dec 28, 2022 at 12:15 PM vaquar khan wrote: > I would suggest Deequ , I have

Re: Profiling data quality with Spark

2022-12-27 Thread Gourav Sengupta
Hi Sean, the entire narrative of SPARK being a unified analytics tool falls flat as what should have been an engine on SPARK is now deliberately floated off as a separate company called as Ray, and all the unified narrative rings hollow. SPARK is nothing more than a SQL engine as per SPARKs own

Re: Profiling data quality with Spark

2022-12-27 Thread vaquar khan
I would suggest Deequ , I have implemented many time easy and effective. Regards, Vaquar khan On Tue, Dec 27, 2022, 10:30 PM ayan guha wrote: > The way I would approach is to evaluate GE, Deequ (there is a python > binding called pydeequ) and others like Delta Live tables with expectations >

Re: Profiling data quality with Spark

2022-12-27 Thread ayan guha
The way I would approach is to evaluate GE, Deequ (there is a python binding called pydeequ) and others like Delta Live tables with expectations from Data Quality feature perspective. All these tools have their pros and cons, and all of them are compatible with spark as a compute engine. Also,

Re: Profiling data quality with Spark

2022-12-27 Thread Walaa Eldin Moustafa
Rajat, You might want to read about Data Sentinel, a data validation tool on Spark that is developed at LinkedIn. https://engineering.linkedin.com/blog/2020/data-sentinel-automating-data-validation The project is not open source, but the blog post might give you insights about how such a system

Re: Profiling data quality with Spark

2022-12-27 Thread Sean Owen
I think this is kind of mixed up. Data warehouses are simple SQL creatures; Spark is (also) a distributed compute framework. Kind of like comparing maybe a web server to Java. Are you thinking of Spark SQL? then I dunno sure you may well find it more complicated, but it's also just a data

Re: Profiling data quality with Spark

2022-12-27 Thread Gourav Sengupta
Hi, SPARK is just another querying engine with a lot of hype. I would highly suggest using Redshift (storage and compute decoupled mode) or Snowflake without all this super complicated understanding of containers/ disk-space, mind numbing variables, rocket science tuning, hair splitting failure

Re: Profiling data quality with Spark

2022-12-27 Thread Mich Talebzadeh
Well, you need to qualify your statement on data quality. Are you talking about data lineage here? HTH view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all

Profiling data quality with Spark

2022-12-27 Thread rajat kumar
Hi Folks Hoping you are doing well, I want to implement data quality to detect issues in data in advance. I have heard about few frameworks like GE/Deequ. Can anyone pls suggest which one is good and how do I get started on it? Regards Rajat