Currently, both Drools and Javascript are already part of the scripting capabilities of Hop and can be used immediately if needed to easily build data quality rules and identify and track such problems. About the profiling I agree with Matt’s vision, something else is needed to properly address the problem.
Sergio Sergio Ramazzina Senior Solution Architect [cid:serasoft_746260c3-106b-4d6f-b68e-93b8ac09441a.gif]. . T: +39 02 92979810<tel:+39%2002%2092979810> M: +39 3472103689<tel:+39%203472103689> <mailto:[email protected]>[email protected]<mailto:[email protected]> - www.<https://www.yourdomain.url>serasoft.it via Milano, 78 - 20013 Magenta (MI) . . Da: Thad Guidry <[email protected]> Inviato: mercoledì 15 giugno 2022 18:55 A: [email protected] Oggetto: Re: hop/Data Profiling Actually, they could use Drools (and in theory, Javascript or Groovy rule engines) right? Even if not directly supported, they could be added in some of the Scripting modules. Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/ On Wed, Jun 15, 2022 at 10:27 AM Matt Casters <[email protected]<mailto:[email protected]>> wrote: Great topic! So to add to that, data profiling can be done right now using the common statistical features but it's geared towards operational data profiling. Everything you can aggregate (min, max, count all, count non-null, ...) or checksum can be kept track of and I would consider it "best-practice" to do this in scenarios where data is being staged before processing for example. That way you can set alerts on the profiling data (counts) to see that not too many records are being rejected. Another example would be to put alerts on certain fields in data sets to see that they're not over 80% null (again as an example). I kept this sort of "operational data profiling" in mind when I was architecting the new monitoring and logging functionality for the post 2.0 versions. As far as the user interfaces are concerned for "online data profiling", usually used to profile input data sources and so on, I'm going to join Bart in inviting the community to submit requirements. I'm convinced there's a lot we can do with little effort but I still think it's always better to start from those fresh requirements. Thanks in advance! Matt On Wed, Jun 15, 2022 at 4:58 PM Bart Maertens <[email protected]<mailto:[email protected]>> wrote: Hi Kevin, There are no dedicated data profiling/quality transforms in Hop (yet), while simultaneously, everything can be used to build data profiling/quality checks. You can build your own data quality checks and profiling in a Hop project or framework. We'll probably do more on both quality and profiling in future releases, but that functionality is not available yet. Feel free to create an improvement ticket in JIRA so we can keep track of it. Regards, Bart On Wed, Jun 15, 2022 at 4:48 PM Kevin L Kitts <[email protected]<mailto:[email protected]>> wrote: Hi All, I saw in the documentation “Getting Started” section a reference to “Data Profiling”. I’d like to find more information on how data profiling and data quality related tasks are accomplished in hop. Is there a section of the documentation that describes data profiling/data quality features of hop? Thanks! -- Neo4j Chief Solutions Architect ✉ [email protected]<mailto:[email protected]>
