[ https://issues.apache.org/jira/browse/GRIFFIN-358?focusedWorklogId=619308&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619308 ]
Chitral Verma logged work on GRIFFIN-358: ----------------------------------------- Author: Chitral Verma Created on: 06/Jul/21 11:47 Start Date: 06/Jul/21 11:47 Worklog Time Spent: 504h Issue Time Tracking ------------------- Worklog Id: (was: 619308) Time Spent: 505h 40m (was: 1h 40m) > Rewrite the Rule/Measure implementations > ---------------------------------------- > > Key: GRIFFIN-358 > URL: https://issues.apache.org/jira/browse/GRIFFIN-358 > Project: Griffin > Issue Type: New Feature > Reporter: Chitral Verma > Assignee: Chitral Verma > Priority: Major > Time Spent: 505h 40m > Remaining Estimate: 0h > > Current `RuleParams` can be of the following 3 DSL types, > * Data Ops (for source preprocessing) > * Griffin DSL > * SparkSQL > GriffinDSL allows the implementation of measures (DQ Types) like > Completeness, Accuracy, etc. > To enable such measures there is an extensive implementation of expression, > task hierarchies, parsing and most of this is heavily dependent on > scala-parser-combinators. > At the end of the implementation, Griffin DSL tries to mimic a SparkSQL-like > query but substitution of user-defined constraints. > This approach has some drawbacks, > * Suboptimal processing. While the transformation steps execute in parallel > on the driver, the data set is still scanned multiple times in parallel which > can cause inefficiencies on the SparkSession side and the internal task > scheduler was single-threaded. Even though the data set can be cached, still > it branched and crucial memory is required for holding the dataset rather > than processing it. > * Internal functions of Spark are not used. Data preprocessing has a very > limited scope currently even though we have 100s spark SQL functions > available for use. > * This blocks structured streaming. The manually constructed SQL queries > cause multiple aggregations in the same query on a streaming data set which > is not supported by Spark's Structured streaming. There are workarounds for > this but they all require rewriting the *Expr2DQSteps classes. > * Griffin DSL is SparkSQL like but not 100% compatible. Profiling measure > and SparkSQL are redundant functionalities > The proposed solution involves SparkSQL DSL based measures and some changes > to Rule Params. This will enhance the data pre proc flows and the measures > themselves -- This message was sent by Atlassian Jira (v8.3.4#803005)