guoyuepeng commented on PR #654: URL: https://github.com/apache/griffin/pull/654#issuecomment-2185750569
> @guoyuepeng adding data integration to Griffin with a UI seems like a good idea. are we planning UI changes as well? > > my suggestion is to, > > * Use logstash/ Apache Spark/ DBT/ Polars / Apache Datafusion backends for data integration > * Griffin data quality rules for DQ > > * This can be pull or push based > * Scheduling by Apache Airflow > * and cataloguing/ lineage collection by [Datahub](https://github.com/datahub-project/datahub) > > all of this via one UI. > > If this makes sense, then I can create an architecture doc for this @chitralverma Since we will change a lot based on this proposal, UI will definitely need upgrade as well. When I say integration, I don't mean data integration, I mean append data quality checking pipelines behind usual ETL pipelines. ``` existing enterprise data ETL pipelines -> data quality checking pipelines(contributed by apache griffin) ETLJob ---> ETLJob will becomes ETLJob ---> ETLJob | ^ | | (optional) ---> DQJob ``` About scheduling, I think each data org already has its own scheduler, scheduling their daily data jobs. So it is really hard to ask them to setup and support a new scheduler, I want to leverage existing scheduler, we just need to generate job plan and somehow ingest our data quality jobs into existing job DAG. Of course, we can provide a default scheduler, but this scheduler is replaceable. for data lineage, which is very important for troubleshooting, especially when data quality issue has been identified and we need to fix data qualtiy issue, we need go through lineage to fix the origin code, no matter it is SQL or python . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@griffin.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org