Re: Profiling data quality with Spark

Gourav Sengupta Tue, 27 Dec 2022 19:55:41 -0800

Hi,

SPARK is just another querying engine with a lot of hype.

I would highly suggest using Redshift (storage and compute decoupled mode)
or Snowflake without all this super complicated understanding of
containers/ disk-space, mind numbing variables, rocket science tuning, hair
splitting failure scenarios, etc. After that try to choose solutions like
Athena, or Trino/ Presto, and then come to SPARK.

Try out solutions like  "great expectations" if you are looking for data
quality and not entirely sucked into the world of SPARK and want to keep
your options open.

Dont get me wrong, SPARK used to be great in 2016-2017, but there are
superb alternatives now and the industry, in this recession, should focus
on getting more value for every single dollar they spend.

Best of luck.

Regards,
Gourav Sengupta

On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Well, you need to qualify your statement on data quality. Are you talking
> about data lineage here?
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 27 Dec 2022 at 19:25, rajat kumar <kumar.rajat20...@gmail.com>
> wrote:
>
>> Hi Folks
>> Hoping you are doing well, I want to implement data quality to detect
>> issues in data in advance. I have heard about few frameworks like GE/Deequ.
>> Can anyone pls suggest which one is good and how do I get started on it?
>>
>> Regards
>> Rajat
>>
>

Re: Profiling data quality with Spark

Reply via email to