[jira] [Work logged] (GRIFFIN-358) Rewrite the Rule/Measure implementations

Chitral Verma (Jira) Tue, 06 Jul 2021 04:48:07 -0700


     [ 
https://issues.apache.org/jira/browse/GRIFFIN-358?focusedWorklogId=619308&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619308
 ]


Chitral Verma logged work on GRIFFIN-358:
-----------------------------------------

                Author: Chitral Verma
            Created on: 06/Jul/21 11:47
            Start Date: 06/Jul/21 11:47
    Worklog Time Spent: 504h 

Issue Time Tracking
-------------------

    Worklog Id:     (was: 619308)
    Time Spent: 505h 40m  (was: 1h 40m)

> Rewrite the Rule/Measure implementations
> ----------------------------------------
>
>                 Key: GRIFFIN-358
>                 URL: https://issues.apache.org/jira/browse/GRIFFIN-358
>             Project: Griffin
>          Issue Type: New Feature
>            Reporter: Chitral Verma
>            Assignee: Chitral Verma
>            Priority: Major
>          Time Spent: 505h 40m
>  Remaining Estimate: 0h
>
> Current `RuleParams` can be of the following 3 DSL types,
>  * Data Ops (for source preprocessing)
>  * Griffin DSL
>  * SparkSQL
> GriffinDSL allows the implementation of measures (DQ Types) like 
> Completeness, Accuracy, etc.
> To enable such measures there is an extensive implementation of expression, 
> task hierarchies, parsing and most of this is heavily dependent on 
> scala-parser-combinators.
> At the end of the implementation, Griffin DSL tries to mimic a SparkSQL-like 
> query but substitution of user-defined constraints.
> This approach has some drawbacks,
>  * Suboptimal processing. While the transformation steps execute in parallel 
> on the driver, the data set is still scanned multiple times in parallel which 
> can cause inefficiencies on the SparkSession side and the internal task 
> scheduler was single-threaded. Even though the data set can be cached, still 
> it branched and crucial memory is required for holding the dataset rather 
> than processing it.
>  * Internal functions of Spark are not used. Data preprocessing has a very 
> limited scope currently even though we have 100s spark SQL functions 
> available for use.
>  * This blocks structured streaming. The manually constructed SQL queries 
> cause multiple aggregations in the same query on a streaming data set which 
> is not supported by Spark's Structured streaming. There are workarounds for 
> this but they all require rewriting the *Expr2DQSteps classes.
>  * Griffin DSL is SparkSQL like but not 100% compatible. Profiling measure 
> and SparkSQL are redundant functionalities
> The proposed solution involves SparkSQL DSL based measures and some changes 
> to Rule Params. This will enhance the data pre proc flows and the measures 
> themselves



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-358) Rewrite the Rule/Measure implementations

Reply via email to