guoyuepeng commented on PR #654:
URL: https://github.com/apache/griffin/pull/654#issuecomment-2185750569

   > @guoyuepeng adding data integration to Griffin with a UI seems like a good 
idea. are we planning UI changes as well?
   > 
   > my suggestion is to,
   > 
   > * Use logstash/ Apache Spark/ DBT/ Polars / Apache Datafusion backends for 
data integration
   > * Griffin data quality rules for DQ
   >   
   >   * This can be pull or push based
   > * Scheduling by Apache Airflow
   > * and cataloguing/ lineage collection by 
[Datahub](https://github.com/datahub-project/datahub)
   > 
   > all of this via one UI.
   > 
   > If this makes sense, then I can create an architecture doc for this
   
   @chitralverma 
   Since we will change a lot based on this proposal, UI will definitely need 
upgrade as well.
   
   When I say integration, I don't mean data integration, I mean append data 
quality checking pipelines behind usual ETL pipelines.
   ```
   existing enterprise data ETL pipelines   -> data quality checking 
pipelines(contributed by apache griffin)
   
   ETLJob  ---> ETLJob will becomes
   
   ETLJob  ---> ETLJob
       |                  ^
       |                  | (optional) 
       ---> DQJob  
   
   ```
   About scheduling, I think each data org already has its own scheduler, 
scheduling their daily data jobs.
   So it is really hard to ask them to setup and support a new scheduler, I 
want to leverage existing scheduler, we just need to generate job plan and 
somehow ingest our data quality jobs into existing job DAG. 
   Of course, we can provide a default scheduler, but this scheduler is 
replaceable.
   
   for data lineage, which is very important for troubleshooting, especially 
when data quality issue has been identified and we need to fix data qualtiy 
issue, we need go through lineage to fix the origin code, no matter it is SQL 
or python .
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@griffin.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to