Re: Data Contracts
No worries. Have you had a chance to look at it? Since this thread has gone dead, I assume there is no appetite for adding data contract functionality..? Regards, Phillip On Mon, 19 Jun 2023, 11:23 Deepak Sharma, wrote: > Sorry for using simple in my last email . > It’s not gonna to be simple in any terms . > Thanks for sharing the git Philip . > Will definitely go through it . > > Thanks > Deepak > > On Mon, 19 Jun 2023 at 3:47 PM, Phillip Henry > wrote: > >> I think it might be a bit more complicated than this (but happy to be >> proved wrong). >> >> I have a minimum working example at: >> >> https://github.com/PhillHenry/SparkConstraints.git >> >> that runs out-of-the-box (mvn test) and demonstrates what I am trying to >> achieve. >> >> A test persists a DataFrame that conforms to the contract and >> demonstrates that one that does not, throws an Exception. >> >> I've had to slightly modify 3 Spark files to add the data contract >> functionality. If you can think of a more elegant solution, I'd be very >> grateful. >> >> Regards, >> >> Phillip >> >> >> >> >> On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma >> wrote: >> >>> It can be as simple as adding a function to the spark session builder >>> specifically on the read which can take the yaml file(definition if data >>> co tracts to be in yaml) and apply it to the data frame . >>> It can ignore the rows not matching the data contracts defined in the >>> yaml . >>> >>> Thanks >>> Deepak >>> >>> On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry >>> wrote: >>> >>>> For my part, I'm not too concerned about the mechanism used to >>>> implement the validation as long as it's rich enough to express the >>>> constraints. >>>> >>>> I took a look at JSON Schemas (for which there are a number of JVM >>>> implementations) but I don't think it can handle more complex data types >>>> like dates. Maybe Elliot can comment on this? >>>> >>>> Ideally, *any* reasonable mechanism could be plugged in. >>>> >>>> But what struck me from trying to write a Proof of Concept was that it >>>> was quite hard to inject my code into this particular area of the Spark >>>> machinery. It could very well be due to my limited understanding of the >>>> codebase, but it seemed the Spark code would need a bit of a refactor >>>> before a component could be injected. Maybe people in this forum with >>>> greater knowledge in this area could comment? >>>> >>>> BTW, it's interesting to see that Databrick's "Delta Live Tables" >>>> appear to be attempting to implement data contracts within their ecosystem. >>>> Unfortunately, I think it's closed source and Python only. >>>> >>>> Regards, >>>> >>>> Phillip >>>> >>>> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> It would be interesting if we think about creating a contract >>>>> validation library written in JSON format. This would ensure a validation >>>>> mechanism that will rely on this library and could be shared among >>>>> relevant >>>>> parties. Will that be a starting point? >>>>> >>>>> HTH >>>>> >>>>> Mich Talebzadeh, >>>>> Lead Solutions Architect/Engineering Lead >>>>> Palantir Technologies Limited >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>>view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, 14 Jun 2023 at 11:13, Jean-
Re: Data Contracts
Sorry for using simple in my last email . It’s not gonna to be simple in any terms . Thanks for sharing the git Philip . Will definitely go through it . Thanks Deepak On Mon, 19 Jun 2023 at 3:47 PM, Phillip Henry wrote: > I think it might be a bit more complicated than this (but happy to be > proved wrong). > > I have a minimum working example at: > > https://github.com/PhillHenry/SparkConstraints.git > > that runs out-of-the-box (mvn test) and demonstrates what I am trying to > achieve. > > A test persists a DataFrame that conforms to the contract and demonstrates > that one that does not, throws an Exception. > > I've had to slightly modify 3 Spark files to add the data contract > functionality. If you can think of a more elegant solution, I'd be very > grateful. > > Regards, > > Phillip > > > > > On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma > wrote: > >> It can be as simple as adding a function to the spark session builder >> specifically on the read which can take the yaml file(definition if data >> co tracts to be in yaml) and apply it to the data frame . >> It can ignore the rows not matching the data contracts defined in the >> yaml . >> >> Thanks >> Deepak >> >> On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry >> wrote: >> >>> For my part, I'm not too concerned about the mechanism used to implement >>> the validation as long as it's rich enough to express the constraints. >>> >>> I took a look at JSON Schemas (for which there are a number of JVM >>> implementations) but I don't think it can handle more complex data types >>> like dates. Maybe Elliot can comment on this? >>> >>> Ideally, *any* reasonable mechanism could be plugged in. >>> >>> But what struck me from trying to write a Proof of Concept was that it >>> was quite hard to inject my code into this particular area of the Spark >>> machinery. It could very well be due to my limited understanding of the >>> codebase, but it seemed the Spark code would need a bit of a refactor >>> before a component could be injected. Maybe people in this forum with >>> greater knowledge in this area could comment? >>> >>> BTW, it's interesting to see that Databrick's "Delta Live Tables" appear >>> to be attempting to implement data contracts within their ecosystem. >>> Unfortunately, I think it's closed source and Python only. >>> >>> Regards, >>> >>> Phillip >>> >>> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> It would be interesting if we think about creating a contract >>>> validation library written in JSON format. This would ensure a validation >>>> mechanism that will rely on this library and could be shared among relevant >>>> parties. Will that be a starting point? >>>> >>>> HTH >>>> >>>> Mich Talebzadeh, >>>> Lead Solutions Architect/Engineering Lead >>>> Palantir Technologies Limited >>>> London >>>> United Kingdom >>>> >>>> >>>>view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin wrote: >>>> >>>>> Hi, >>>>> >>>>> While I was at PayPal, we open sourced a template of Data Contract, it >>>>> is here: https://github.com/paypal/data-contract-template. Companies >>>>> like GX (Great Expectations) are interested in using it. >>>>> >>>>> Spark could read some elements form it pretty easily, like schema >>>>> validation, some rules validations. Spark could also generate an embryo of >>>>> data contracts… >>>>> >>>>> —jgp >>>>> >>>>> >>>>> On Jun 13, 2
Re: Data Contracts
I think it might be a bit more complicated than this (but happy to be proved wrong). I have a minimum working example at: https://github.com/PhillHenry/SparkConstraints.git that runs out-of-the-box (mvn test) and demonstrates what I am trying to achieve. A test persists a DataFrame that conforms to the contract and demonstrates that one that does not, throws an Exception. I've had to slightly modify 3 Spark files to add the data contract functionality. If you can think of a more elegant solution, I'd be very grateful. Regards, Phillip On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma wrote: > It can be as simple as adding a function to the spark session builder > specifically on the read which can take the yaml file(definition if data > co tracts to be in yaml) and apply it to the data frame . > It can ignore the rows not matching the data contracts defined in the yaml > . > > Thanks > Deepak > > On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry > wrote: > >> For my part, I'm not too concerned about the mechanism used to implement >> the validation as long as it's rich enough to express the constraints. >> >> I took a look at JSON Schemas (for which there are a number of JVM >> implementations) but I don't think it can handle more complex data types >> like dates. Maybe Elliot can comment on this? >> >> Ideally, *any* reasonable mechanism could be plugged in. >> >> But what struck me from trying to write a Proof of Concept was that it >> was quite hard to inject my code into this particular area of the Spark >> machinery. It could very well be due to my limited understanding of the >> codebase, but it seemed the Spark code would need a bit of a refactor >> before a component could be injected. Maybe people in this forum with >> greater knowledge in this area could comment? >> >> BTW, it's interesting to see that Databrick's "Delta Live Tables" appear >> to be attempting to implement data contracts within their ecosystem. >> Unfortunately, I think it's closed source and Python only. >> >> Regards, >> >> Phillip >> >> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> It would be interesting if we think about creating a contract validation >>> library written in JSON format. This would ensure a validation mechanism >>> that will rely on this library and could be shared among relevant parties. >>> Will that be a starting point? >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Lead Solutions Architect/Engineering Lead >>> Palantir Technologies Limited >>> London >>> United Kingdom >>> >>> >>>view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin wrote: >>> >>>> Hi, >>>> >>>> While I was at PayPal, we open sourced a template of Data Contract, it >>>> is here: https://github.com/paypal/data-contract-template. Companies >>>> like GX (Great Expectations) are interested in using it. >>>> >>>> Spark could read some elements form it pretty easily, like schema >>>> validation, some rules validations. Spark could also generate an embryo of >>>> data contracts… >>>> >>>> —jgp >>>> >>>> >>>> On Jun 13, 2023, at 07:25, Mich Talebzadeh >>>> wrote: >>>> >>>> From my limited understanding of data contracts, there are two factors >>>> that deem necessary. >>>> >>>> >>>>1. procedure matter >>>>2. technical matter >>>> >>>> I mean this is nothing new. Some tools like Cloud data fusion can >>>> assist when the procedures are validated. Simply "The process of >>>> integrating multiple data sources to produce more consistent, accurate, and >>>> useful information than that provided by any individual data source."
Re: Data Contracts
It can be as simple as adding a function to the spark session builder specifically on the read which can take the yaml file(definition if data co tracts to be in yaml) and apply it to the data frame . It can ignore the rows not matching the data contracts defined in the yaml . Thanks Deepak On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry wrote: > For my part, I'm not too concerned about the mechanism used to implement > the validation as long as it's rich enough to express the constraints. > > I took a look at JSON Schemas (for which there are a number of JVM > implementations) but I don't think it can handle more complex data types > like dates. Maybe Elliot can comment on this? > > Ideally, *any* reasonable mechanism could be plugged in. > > But what struck me from trying to write a Proof of Concept was that it was > quite hard to inject my code into this particular area of the Spark > machinery. It could very well be due to my limited understanding of the > codebase, but it seemed the Spark code would need a bit of a refactor > before a component could be injected. Maybe people in this forum with > greater knowledge in this area could comment? > > BTW, it's interesting to see that Databrick's "Delta Live Tables" appear > to be attempting to implement data contracts within their ecosystem. > Unfortunately, I think it's closed source and Python only. > > Regards, > > Phillip > > On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> It would be interesting if we think about creating a contract validation >> library written in JSON format. This would ensure a validation mechanism >> that will rely on this library and could be shared among relevant parties. >> Will that be a starting point? >> >> HTH >> >> Mich Talebzadeh, >> Lead Solutions Architect/Engineering Lead >> Palantir Technologies Limited >> London >> United Kingdom >> >> >>view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin wrote: >> >>> Hi, >>> >>> While I was at PayPal, we open sourced a template of Data Contract, it >>> is here: https://github.com/paypal/data-contract-template. Companies >>> like GX (Great Expectations) are interested in using it. >>> >>> Spark could read some elements form it pretty easily, like schema >>> validation, some rules validations. Spark could also generate an embryo of >>> data contracts… >>> >>> —jgp >>> >>> >>> On Jun 13, 2023, at 07:25, Mich Talebzadeh >>> wrote: >>> >>> From my limited understanding of data contracts, there are two factors >>> that deem necessary. >>> >>> >>>1. procedure matter >>>2. technical matter >>> >>> I mean this is nothing new. Some tools like Cloud data fusion can assist >>> when the procedures are validated. Simply "The process of integrating >>> multiple data sources to produce more consistent, accurate, and useful >>> information than that provided by any individual data source.". In the old >>> time, we had staging tables that were used to clean and prune data from >>> multiple sources. Nowadays we use the so-called Integration layer. If you >>> use Spark as an ETL tool, then you have to build this validation yourself. >>> Case in point, how to map customer_id from one source to customer_no from >>> another. Legacy systems are full of these anomalies. MDM can help but >>> requires human intervention which is time consuming. I am not sure the role >>> of Spark here except being able to read the mapping tables. >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Lead Solutions Architect/Engineering Lead >>> Palantir Technologies Limited >>> London >>> United Kingdom >>> >>>view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/M
Re: Data Contracts
For my part, I'm not too concerned about the mechanism used to implement the validation as long as it's rich enough to express the constraints. I took a look at JSON Schemas (for which there are a number of JVM implementations) but I don't think it can handle more complex data types like dates. Maybe Elliot can comment on this? Ideally, *any* reasonable mechanism could be plugged in. But what struck me from trying to write a Proof of Concept was that it was quite hard to inject my code into this particular area of the Spark machinery. It could very well be due to my limited understanding of the codebase, but it seemed the Spark code would need a bit of a refactor before a component could be injected. Maybe people in this forum with greater knowledge in this area could comment? BTW, it's interesting to see that Databrick's "Delta Live Tables" appear to be attempting to implement data contracts within their ecosystem. Unfortunately, I think it's closed source and Python only. Regards, Phillip On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh wrote: > It would be interesting if we think about creating a contract validation > library written in JSON format. This would ensure a validation mechanism > that will rely on this library and could be shared among relevant parties. > Will that be a starting point? > > HTH > > Mich Talebzadeh, > Lead Solutions Architect/Engineering Lead > Palantir Technologies Limited > London > United Kingdom > > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin wrote: > >> Hi, >> >> While I was at PayPal, we open sourced a template of Data Contract, it is >> here: https://github.com/paypal/data-contract-template. Companies like >> GX (Great Expectations) are interested in using it. >> >> Spark could read some elements form it pretty easily, like schema >> validation, some rules validations. Spark could also generate an embryo of >> data contracts… >> >> —jgp >> >> >> On Jun 13, 2023, at 07:25, Mich Talebzadeh >> wrote: >> >> From my limited understanding of data contracts, there are two factors >> that deem necessary. >> >> >>1. procedure matter >>2. technical matter >> >> I mean this is nothing new. Some tools like Cloud data fusion can assist >> when the procedures are validated. Simply "The process of integrating >> multiple data sources to produce more consistent, accurate, and useful >> information than that provided by any individual data source.". In the old >> time, we had staging tables that were used to clean and prune data from >> multiple sources. Nowadays we use the so-called Integration layer. If you >> use Spark as an ETL tool, then you have to build this validation yourself. >> Case in point, how to map customer_id from one source to customer_no from >> another. Legacy systems are full of these anomalies. MDM can help but >> requires human intervention which is time consuming. I am not sure the role >> of Spark here except being able to read the mapping tables. >> >> HTH >> >> Mich Talebzadeh, >> Lead Solutions Architect/Engineering Lead >> Palantir Technologies Limited >> London >> United Kingdom >> >>view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Tue, 13 Jun 2023 at 10:01, Phillip Henry >> wrote: >> >>> Hi, Fokko and Deepak. >>> >>> The problem with DBT and Great Expectations (and Soda too, I believe) is >>> that by the time they find the problem, the error is already in production >>> - and fixing production can be a nightmare. >>> >>>
Re: Data Contracts
It would be interesting if we think about creating a contract validation library written in JSON format. This would ensure a validation mechanism that will rely on this library and could be shared among relevant parties. Will that be a starting point? HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin wrote: > Hi, > > While I was at PayPal, we open sourced a template of Data Contract, it is > here: https://github.com/paypal/data-contract-template. Companies like GX > (Great Expectations) are interested in using it. > > Spark could read some elements form it pretty easily, like schema > validation, some rules validations. Spark could also generate an embryo of > data contracts… > > —jgp > > > On Jun 13, 2023, at 07:25, Mich Talebzadeh > wrote: > > From my limited understanding of data contracts, there are two factors > that deem necessary. > > >1. procedure matter >2. technical matter > > I mean this is nothing new. Some tools like Cloud data fusion can assist > when the procedures are validated. Simply "The process of integrating > multiple data sources to produce more consistent, accurate, and useful > information than that provided by any individual data source.". In the old > time, we had staging tables that were used to clean and prune data from > multiple sources. Nowadays we use the so-called Integration layer. If you > use Spark as an ETL tool, then you have to build this validation yourself. > Case in point, how to map customer_id from one source to customer_no from > another. Legacy systems are full of these anomalies. MDM can help but > requires human intervention which is time consuming. I am not sure the role > of Spark here except being able to read the mapping tables. > > HTH > > Mich Talebzadeh, > Lead Solutions Architect/Engineering Lead > Palantir Technologies Limited > London > United Kingdom > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 13 Jun 2023 at 10:01, Phillip Henry > wrote: > >> Hi, Fokko and Deepak. >> >> The problem with DBT and Great Expectations (and Soda too, I believe) is >> that by the time they find the problem, the error is already in production >> - and fixing production can be a nightmare. >> >> What's more, we've found that nobody ever looks at the data quality >> reports we already generate. >> >> You can, of course, run DBT, GT etc as part of a CI/CD pipeline but it's >> usually against synthetic or at best sampled data (laws like GDPR generally >> stop personal information data being anywhere but prod). >> >> What I'm proposing is something that stops production data ever being >> tainted. >> >> Hi, Elliot. >> >> Nice to see you again (we worked together 20 years ago)! >> >> The problem here is that a schema itself won't protect me (at least as I >> understand your argument). For instance, I have medical records that say >> some of my patients are 999 years old which is clearly ridiculous but their >> age correctly conforms to an integer data type. I have other patients who >> were discharged *before* they were admitted to hospital. I have 28 >> patients out of literally millions who recently attended hospital but were >> discharged on 1/1/1900. As you can imagine, this made the average length of >> stay (a key metric for acute hospitals) much lower than it should have >> been. It only came to light when some average length of stays were >> negative! >> >> In all these cases, the data faithfully adhered to the schema. >> >> Hi, Ryan. >> >> This is an interesting point. There *should* indeed be a human >&
Re: Data Contracts
Hi, While I was at PayPal, we open sourced a template of Data Contract, it is here: https://github.com/paypal/data-contract-template. Companies like GX (Great Expectations) are interested in using it. Spark could read some elements form it pretty easily, like schema validation, some rules validations. Spark could also generate an embryo of data contracts… —jgp > On Jun 13, 2023, at 07:25, Mich Talebzadeh wrote: > > From my limited understanding of data contracts, there are two factors that > deem necessary. > > procedure matter > technical matter > I mean this is nothing new. Some tools like Cloud data fusion can assist when > the procedures are validated. Simply "The process of integrating multiple > data sources to produce more consistent, accurate, and useful information > than that provided by any individual data source.". In the old time, we had > staging tables that were used to clean and prune data from multiple sources. > Nowadays we use the so-called Integration layer. If you use Spark as an ETL > tool, then you have to build this validation yourself. Case in point, how to > map customer_id from one source to customer_no from another. Legacy systems > are full of these anomalies. MDM can help but requires human intervention > which is time consuming. I am not sure the role of Spark here except being > able to read the mapping tables. > > HTH > > Mich Talebzadeh, > Lead Solutions Architect/Engineering Lead > Palantir Technologies Limited > London > United Kingdom > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > https://en.everybodywiki.com/Mich_Talebzadeh > > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > On Tue, 13 Jun 2023 at 10:01, Phillip Henry <mailto:londonjava...@gmail.com>> wrote: >> Hi, Fokko and Deepak. >> >> The problem with DBT and Great Expectations (and Soda too, I believe) is >> that by the time they find the problem, the error is already in production - >> and fixing production can be a nightmare. >> >> What's more, we've found that nobody ever looks at the data quality reports >> we already generate. >> >> You can, of course, run DBT, GT etc as part of a CI/CD pipeline but it's >> usually against synthetic or at best sampled data (laws like GDPR generally >> stop personal information data being anywhere but prod). >> >> What I'm proposing is something that stops production data ever being >> tainted. >> >> Hi, Elliot. >> >> Nice to see you again (we worked together 20 years ago)! >> >> The problem here is that a schema itself won't protect me (at least as I >> understand your argument). For instance, I have medical records that say >> some of my patients are 999 years old which is clearly ridiculous but their >> age correctly conforms to an integer data type. I have other patients who >> were discharged before they were admitted to hospital. I have 28 patients >> out of literally millions who recently attended hospital but were discharged >> on 1/1/1900. As you can imagine, this made the average length of stay (a key >> metric for acute hospitals) much lower than it should have been. It only >> came to light when some average length of stays were negative! >> >> In all these cases, the data faithfully adhered to the schema. >> >> Hi, Ryan. >> >> This is an interesting point. There should indeed be a human connection but >> often there isn't. For instance, I have a friend who complained that his >> company's Zurich office made a breaking change and was not even aware that >> his London based department existed, never mind depended on their data. In >> large organisations, this is pretty common. >> >> TBH, my proposal doesn't address this particular use case (maybe hooks and >> metastore listeners would...?) But my point remains that although these >> relationships should exist, in a sufficiently large organisation, they >> generally don't. And maybe we can help fix that with code? >> >> Would love to hear further thoughts. >> >> Regards, >> >> Phillip >> >> >> >> >> >> On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong > <mailto:fo...@apache.org>> wrote: >>>
Re: Data Contracts
>From my limited understanding of data contracts, there are two factors that deem necessary. 1. procedure matter 2. technical matter I mean this is nothing new. Some tools like Cloud data fusion can assist when the procedures are validated. Simply "The process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source.". In the old time, we had staging tables that were used to clean and prune data from multiple sources. Nowadays we use the so-called Integration layer. If you use Spark as an ETL tool, then you have to build this validation yourself. Case in point, how to map customer_id from one source to customer_no from another. Legacy systems are full of these anomalies. MDM can help but requires human intervention which is time consuming. I am not sure the role of Spark here except being able to read the mapping tables. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Tue, 13 Jun 2023 at 10:01, Phillip Henry wrote: > Hi, Fokko and Deepak. > > The problem with DBT and Great Expectations (and Soda too, I believe) is > that by the time they find the problem, the error is already in production > - and fixing production can be a nightmare. > > What's more, we've found that nobody ever looks at the data quality > reports we already generate. > > You can, of course, run DBT, GT etc as part of a CI/CD pipeline but it's > usually against synthetic or at best sampled data (laws like GDPR generally > stop personal information data being anywhere but prod). > > What I'm proposing is something that stops production data ever being > tainted. > > Hi, Elliot. > > Nice to see you again (we worked together 20 years ago)! > > The problem here is that a schema itself won't protect me (at least as I > understand your argument). For instance, I have medical records that say > some of my patients are 999 years old which is clearly ridiculous but their > age correctly conforms to an integer data type. I have other patients who > were discharged *before* they were admitted to hospital. I have 28 > patients out of literally millions who recently attended hospital but were > discharged on 1/1/1900. As you can imagine, this made the average length of > stay (a key metric for acute hospitals) much lower than it should have > been. It only came to light when some average length of stays were > negative! > > In all these cases, the data faithfully adhered to the schema. > > Hi, Ryan. > > This is an interesting point. There *should* indeed be a human connection > but often there isn't. For instance, I have a friend who complained that > his company's Zurich office made a breaking change and was not even aware > that his London based department existed, never mind depended on their > data. In large organisations, this is pretty common. > > TBH, my proposal doesn't address this particular use case (maybe hooks and > metastore listeners would...?) But my point remains that although these > relationships should exist, in a sufficiently large organisation, they > generally don't. And maybe we can help fix that with code? > > Would love to hear further thoughts. > > Regards, > > Phillip > > > > > > On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong wrote: > >> Hey Phillip, >> >> Thanks for raising this. I like the idea. The question is, should this be >> implemented in Spark or some other framework? I know that dbt has a fairly >> extensive way of testing your data >> <https://www.getdbt.com/product/data-testing/>, and making sure that you >> can enforce assumptions on the columns. The nice thing about dbt is that it >> is built from a software engineering perspective, so all the tests (or >> contracts) are living in version control. Using pull requests you could >> collaborate on changing the contract and making sure that the change has >> gotten enough attention before pushing it to production. Hope this helps! >> >> Kind regards, >> Fokko >> >> Op di 13 jun 2023 om 04:31 schreef Deepak Sharma : >> >>> Spark can be used with tools like great expectations as well to >>> implement the data con
Re: Data Contracts
Hi, Fokko and Deepak. The problem with DBT and Great Expectations (and Soda too, I believe) is that by the time they find the problem, the error is already in production - and fixing production can be a nightmare. What's more, we've found that nobody ever looks at the data quality reports we already generate. You can, of course, run DBT, GT etc as part of a CI/CD pipeline but it's usually against synthetic or at best sampled data (laws like GDPR generally stop personal information data being anywhere but prod). What I'm proposing is something that stops production data ever being tainted. Hi, Elliot. Nice to see you again (we worked together 20 years ago)! The problem here is that a schema itself won't protect me (at least as I understand your argument). For instance, I have medical records that say some of my patients are 999 years old which is clearly ridiculous but their age correctly conforms to an integer data type. I have other patients who were discharged *before* they were admitted to hospital. I have 28 patients out of literally millions who recently attended hospital but were discharged on 1/1/1900. As you can imagine, this made the average length of stay (a key metric for acute hospitals) much lower than it should have been. It only came to light when some average length of stays were negative! In all these cases, the data faithfully adhered to the schema. Hi, Ryan. This is an interesting point. There *should* indeed be a human connection but often there isn't. For instance, I have a friend who complained that his company's Zurich office made a breaking change and was not even aware that his London based department existed, never mind depended on their data. In large organisations, this is pretty common. TBH, my proposal doesn't address this particular use case (maybe hooks and metastore listeners would...?) But my point remains that although these relationships should exist, in a sufficiently large organisation, they generally don't. And maybe we can help fix that with code? Would love to hear further thoughts. Regards, Phillip On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong wrote: > Hey Phillip, > > Thanks for raising this. I like the idea. The question is, should this be > implemented in Spark or some other framework? I know that dbt has a fairly > extensive way of testing your data > <https://www.getdbt.com/product/data-testing/>, and making sure that you > can enforce assumptions on the columns. The nice thing about dbt is that it > is built from a software engineering perspective, so all the tests (or > contracts) are living in version control. Using pull requests you could > collaborate on changing the contract and making sure that the change has > gotten enough attention before pushing it to production. Hope this helps! > > Kind regards, > Fokko > > Op di 13 jun 2023 om 04:31 schreef Deepak Sharma : > >> Spark can be used with tools like great expectations as well to implement >> the data contracts . >> I am not sure though if spark alone can do the data contracts . >> I was reading a blog on data mesh and how to glue it together with data >> contracts , that’s where I came across this spark and great expectations >> mention . >> >> HTH >> >> -Deepak >> >> On Tue, 13 Jun 2023 at 12:48 AM, Elliot West wrote: >> >>> Hi Phillip, >>> >>> While not as fine-grained as your example, there do exist schema systems >>> such as that in Avro that can can evaluate compatible and incompatible >>> changes to the schema, from the perspective of the reader, writer, or both. >>> This provides some potential degree of enforcement, and means to >>> communicate a contract. Interestingly I believe this approach has been >>> applied to both JsonSchema and protobuf as part of the Confluent Schema >>> registry. >>> >>> Elliot. >>> >>> On Mon, 12 Jun 2023 at 12:43, Phillip Henry >>> wrote: >>> >>>> Hi, folks. >>>> >>>> There currently seems to be a buzz around "data contracts". From what I >>>> can tell, these mainly advocate a cultural solution. But instead, could big >>>> data tools be used to enforce these contracts? >>>> >>>> My questions really are: are there any plans to implement data >>>> constraints in Spark (eg, an integer must be between 0 and 100; the date in >>>> column X must be before that in column Y)? And if not, is there an appetite >>>> for them? >>>> >>>> Maybe we could associate constraints with schema metadata that are >>>> enforced in the implementation of a FileFormatDataWriter? >>>> >>>> Just throwing it out there and wondering what other people think. It's >>>> an area that interests me as it seems that over half my problems at the day >>>> job are because of dodgy data. >>>> >>>> Regards, >>>> >>>> Phillip >>>> >>>>
Re: Data Contracts
Hey Phillip, Thanks for raising this. I like the idea. The question is, should this be implemented in Spark or some other framework? I know that dbt has a fairly extensive way of testing your data <https://www.getdbt.com/product/data-testing/>, and making sure that you can enforce assumptions on the columns. The nice thing about dbt is that it is built from a software engineering perspective, so all the tests (or contracts) are living in version control. Using pull requests you could collaborate on changing the contract and making sure that the change has gotten enough attention before pushing it to production. Hope this helps! Kind regards, Fokko Op di 13 jun 2023 om 04:31 schreef Deepak Sharma : > Spark can be used with tools like great expectations as well to implement > the data contracts . > I am not sure though if spark alone can do the data contracts . > I was reading a blog on data mesh and how to glue it together with data > contracts , that’s where I came across this spark and great expectations > mention . > > HTH > > -Deepak > > On Tue, 13 Jun 2023 at 12:48 AM, Elliot West wrote: > >> Hi Phillip, >> >> While not as fine-grained as your example, there do exist schema systems >> such as that in Avro that can can evaluate compatible and incompatible >> changes to the schema, from the perspective of the reader, writer, or both. >> This provides some potential degree of enforcement, and means to >> communicate a contract. Interestingly I believe this approach has been >> applied to both JsonSchema and protobuf as part of the Confluent Schema >> registry. >> >> Elliot. >> >> On Mon, 12 Jun 2023 at 12:43, Phillip Henry >> wrote: >> >>> Hi, folks. >>> >>> There currently seems to be a buzz around "data contracts". From what I >>> can tell, these mainly advocate a cultural solution. But instead, could big >>> data tools be used to enforce these contracts? >>> >>> My questions really are: are there any plans to implement data >>> constraints in Spark (eg, an integer must be between 0 and 100; the date in >>> column X must be before that in column Y)? And if not, is there an appetite >>> for them? >>> >>> Maybe we could associate constraints with schema metadata that are >>> enforced in the implementation of a FileFormatDataWriter? >>> >>> Just throwing it out there and wondering what other people think. It's >>> an area that interests me as it seems that over half my problems at the day >>> job are because of dodgy data. >>> >>> Regards, >>> >>> Phillip >>> >>>
Re: Data Contracts
Spark can be used with tools like great expectations as well to implement the data contracts . I am not sure though if spark alone can do the data contracts . I was reading a blog on data mesh and how to glue it together with data contracts , that’s where I came across this spark and great expectations mention . HTH -Deepak On Tue, 13 Jun 2023 at 12:48 AM, Elliot West wrote: > Hi Phillip, > > While not as fine-grained as your example, there do exist schema systems > such as that in Avro that can can evaluate compatible and incompatible > changes to the schema, from the perspective of the reader, writer, or both. > This provides some potential degree of enforcement, and means to > communicate a contract. Interestingly I believe this approach has been > applied to both JsonSchema and protobuf as part of the Confluent Schema > registry. > > Elliot. > > On Mon, 12 Jun 2023 at 12:43, Phillip Henry > wrote: > >> Hi, folks. >> >> There currently seems to be a buzz around "data contracts". From what I >> can tell, these mainly advocate a cultural solution. But instead, could big >> data tools be used to enforce these contracts? >> >> My questions really are: are there any plans to implement data >> constraints in Spark (eg, an integer must be between 0 and 100; the date in >> column X must be before that in column Y)? And if not, is there an appetite >> for them? >> >> Maybe we could associate constraints with schema metadata that are >> enforced in the implementation of a FileFormatDataWriter? >> >> Just throwing it out there and wondering what other people think. It's an >> area that interests me as it seems that over half my problems at the day >> job are because of dodgy data. >> >> Regards, >> >> Phillip >> >>
Re: Data Contracts
Hi Phillip, While not as fine-grained as your example, there do exist schema systems such as that in Avro that can can evaluate compatible and incompatible changes to the schema, from the perspective of the reader, writer, or both. This provides some potential degree of enforcement, and means to communicate a contract. Interestingly I believe this approach has been applied to both JsonSchema and protobuf as part of the Confluent Schema registry. Elliot. On Mon, 12 Jun 2023 at 12:43, Phillip Henry wrote: > Hi, folks. > > There currently seems to be a buzz around "data contracts". From what I > can tell, these mainly advocate a cultural solution. But instead, could big > data tools be used to enforce these contracts? > > My questions really are: are there any plans to implement data constraints > in Spark (eg, an integer must be between 0 and 100; the date in column X > must be before that in column Y)? And if not, is there an appetite for them? > > Maybe we could associate constraints with schema metadata that are > enforced in the implementation of a FileFormatDataWriter? > > Just throwing it out there and wondering what other people think. It's an > area that interests me as it seems that over half my problems at the day > job are because of dodgy data. > > Regards, > > Phillip > >
Re: Data Contracts
Hey Phillip, You're right that we can improve tooling to help with data contracts, but I think that a contract still needs to be an agreement between people. Constraints help by helping to ensure a data producer adheres to the contract and gives feedback as soon as possible when assumptions are violated. The problem with considering that the only contract is that it's too easy to change it. For example, if I change a required column to a nullable column, that's a perfectly valid transition --- but only if I've communicated that change to downstream consumers. Ryan On Mon, Jun 12, 2023 at 4:43 AM Phillip Henry wrote: > Hi, folks. > > There currently seems to be a buzz around "data contracts". From what I > can tell, these mainly advocate a cultural solution. But instead, could big > data tools be used to enforce these contracts? > > My questions really are: are there any plans to implement data constraints > in Spark (eg, an integer must be between 0 and 100; the date in column X > must be before that in column Y)? And if not, is there an appetite for them? > > Maybe we could associate constraints with schema metadata that are > enforced in the implementation of a FileFormatDataWriter? > > Just throwing it out there and wondering what other people think. It's an > area that interests me as it seems that over half my problems at the day > job are because of dodgy data. > > Regards, > > Phillip > > -- Ryan Blue Tabular
Data Contracts
Hi, folks. There currently seems to be a buzz around "data contracts". From what I can tell, these mainly advocate a cultural solution. But instead, could big data tools be used to enforce these contracts? My questions really are: are there any plans to implement data constraints in Spark (eg, an integer must be between 0 and 100; the date in column X must be before that in column Y)? And if not, is there an appetite for them? Maybe we could associate constraints with schema metadata that are enforced in the implementation of a FileFormatDataWriter? Just throwing it out there and wondering what other people think. It's an area that interests me as it seems that over half my problems at the day job are because of dodgy data. Regards, Phillip