Chitral Verma created GRIFFIN-358:
-------------------------------------
Summary: Rewrite the Rule/Measure implementations
Key: GRIFFIN-358
URL: https://issues.apache.org/jira/browse/GRIFFIN-358
Project: Griffin
Issue Type: New Feature
Reporter: Chitral Verma
Assignee: Chitral Verma
Current `RuleParams` can be of the following 3 DSL types,
* Data Ops (for source preprocessing)
* Griffin DSL
* SparkSQL
GriffinDSL allows the implementation of measures (DQ Types) like Completeness,
Accuracy, etc.
To enable such measures there is an extensive implementation of expression,
task hierarchies, parsing and most of this is heavily dependent on
scala-parser-combinators.
At the end of the implementation, Griffin DSL tries to mimic a SparkSQL-like
query but substitution of user-defined constraints.
This approach has some drawbacks,
* Suboptimal processing. While the transformation steps execute in parallel on
the driver, the data set is still scanned multiple times in parallel which can
cause inefficiencies on the SparkSession side and the internal task scheduler
was single-threaded. Even though the data set can be cached, still it branched
and crucial memory is required for holding the dataset rather than processing
it.
* Internal functions of Spark are not used. Data preprocessing has a very
limited scope currently even though we have 100s spark SQL functions available
for use.
* This blocks structured streaming. The manually constructed SQL queries cause
multiple aggregations in the same query on a streaming data set which is not
supported by Spark's Structured streaming. There are workarounds for this but
they all require rewriting the *Expr2DQSteps classes.
* Griffin DSL is SparkSQL like but not 100% compatible. Profiling measure and
SparkSQL are redundant functionalities
--
This message was sent by Atlassian Jira
(v8.3.4#803005)