[jira] [Created] (GRIFFIN-358) Rewrite the Rule/Measure implementations

Chitral Verma (Jira) Thu, 18 Mar 2021 09:15:13 -0700

Chitral Verma created GRIFFIN-358:
-------------------------------------

             Summary: Rewrite the Rule/Measure implementations
                 Key: GRIFFIN-358
                 URL: https://issues.apache.org/jira/browse/GRIFFIN-358
             Project: Griffin
          Issue Type: New Feature
            Reporter: Chitral Verma
            Assignee: Chitral Verma



Current `RuleParams` can be of the following 3 DSL types,
 * Data Ops (for source preprocessing)
 * Griffin DSL
 * SparkSQL

GriffinDSL allows the implementation of measures (DQ Types) like Completeness, 
Accuracy, etc.

To enable such measures there is an extensive implementation of expression, 
task hierarchies, parsing and most of this is heavily dependent on 
scala-parser-combinators.

At the end of the implementation, Griffin DSL tries to mimic a SparkSQL-like 
query but substitution of user-defined constraints.

This approach has some drawbacks,
 * Suboptimal processing. While the transformation steps execute in parallel on 
the driver, the data set is still scanned multiple times in parallel which can 
cause inefficiencies on the SparkSession side and the internal task scheduler 
was single-threaded. Even though the data set can be cached, still it branched 
and crucial memory is required for holding the dataset rather than processing 
it.
 * Internal functions of Spark are not used. Data preprocessing has a very 
limited scope currently even though we have 100s spark SQL functions available 
for use.
 * This blocks structured streaming. The manually constructed SQL queries cause 
multiple aggregations in the same query on a streaming data set which is not 
supported by Spark's Structured streaming. There are workarounds for this but 
they all require rewriting the *Expr2DQSteps classes.
 * Griffin DSL is SparkSQL like but not 100% compatible. Profiling measure and 
SparkSQL are redundant functionalities



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (GRIFFIN-358) Rewrite the Rule/Measure implementations

Reply via email to