R: hop/Data Profiling

Sergio Ramazzina Wed, 15 Jun 2022 10:05:08 -0700

Currently, both Drools and Javascript are already part of the scripting 
capabilities of Hop and can be used immediately if needed to easily build data 
quality rules and identify and track such problems.
About the profiling I agree with Matt’s vision, something else is needed to 
properly address the problem.


Sergio


Sergio Ramazzina
Senior Solution Architect
[cid:serasoft_746260c3-106b-4d6f-b68e-93b8ac09441a.gif].        .       T: +39 
02 92979810<tel:+39%2002%2092979810>
M: +39 3472103689<tel:+39%203472103689>
<mailto:[email protected]>[email protected]<mailto:[email protected]>
 - www.<https://www.yourdomain.url>serasoft.it
via Milano, 78 - 20013 Magenta (MI)
.
.
Da: Thad Guidry <[email protected]>
Inviato: mercoledì 15 giugno 2022 18:55
A: [email protected]
Oggetto: Re: hop/Data Profiling

Actually, they could use Drools (and in theory, Javascript or Groovy rule 
engines) right?  Even if not directly supported, they could be added in some of 
the Scripting modules.
Thad
https://www.linkedin.com/in/thadguidry/
https://calendly.com/thadguidry/


On Wed, Jun 15, 2022 at 10:27 AM Matt Casters 
<[email protected]<mailto:[email protected]>> wrote:
Great topic!

So to add to that, data profiling can be done right now using the common 
statistical features but it's geared towards operational data profiling.  
Everything you can aggregate (min, max, count all, count non-null, ...) or 
checksum can be kept track of and I would consider it "best-practice" to do 
this in scenarios where data is being staged before processing for example.   
That way you can set alerts on the profiling data (counts) to see that not too 
many records are being rejected.  Another example would be to put alerts on 
certain fields in data sets to see that they're not over 80% null (again as an 
example).

I kept this sort of "operational data profiling" in mind when I was 
architecting the new monitoring and logging functionality for the post 2.0 
versions.

As far as the user interfaces are concerned for "online data profiling", 
usually used to profile input data sources and so on, I'm going to join Bart in 
inviting the community to submit requirements.  I'm convinced there's a lot we 
can do with little effort but I still think it's always better to start from 
those fresh requirements.

Thanks in advance!
Matt

On Wed, Jun 15, 2022 at 4:58 PM Bart Maertens 
<[email protected]<mailto:[email protected]>> wrote:
Hi Kevin,

There are no dedicated data profiling/quality transforms in Hop (yet), while 
simultaneously, everything can be used to build data profiling/quality checks.
You can build your own data quality checks and profiling in a Hop project or 
framework. We'll probably do more on both quality and profiling in future 
releases, but that functionality is not available yet.
Feel free to create an improvement ticket in JIRA so we can keep track of it.

Regards,
Bart



On Wed, Jun 15, 2022 at 4:48 PM Kevin L Kitts 
<[email protected]<mailto:[email protected]>> wrote:
Hi All,

I saw in the documentation “Getting Started” section a reference to “Data 
Profiling”. I’d like to find more information on how data profiling and data 
quality related tasks are accomplished in hop. Is there a section of the 
documentation that describes data profiling/data quality features of hop?

Thanks!



--
Neo4j Chief Solutions Architect
✉   [email protected]<mailto:[email protected]>

R: hop/Data Profiling

Reply via email to