Hi, I work in HBO’s Data Engineering team. We are evaluating multiple tools as part of implementing our Data Quality framework. I came across Griffin and it looks very promising. I have couple of doubts. It would be great if you can clarify them. And our use cases are mostly batch for now.
1. https://cwiki.apache.org/confluence/display/GRIFFIN/7.+Apache+Griffin+DSL+Guidance states right now Griffin DSL supports only hive and avro as data source and hive, json and avro as data formats. We have other data sources/formats as well. So from the documentation what I understood is if Griffin DSL is not supported I can use spark-sql. Is that correct? So using spark sql can I do the similar kind of configuration for a parquet file residing in s3 and get the metrics? 2. Griffin persists the monitored metrics in elastic cache? If so can I configure it to use an elastic cache which is outside griffin docker? Can you point me to the documentation for that? 3. On the similar note, we are a complete aws shop, any of the active users use griffin in aws? Is there any documentation available? If griffin submits the spark job via livy, I think it should be okay even if we use emr right? 4. How can I do Completeness, Consistency and Validity measures? Is it a future road map item? If so when do you have an GA dates? Also we are happy to contribute if it adds value to our requirement. Thanks, Nidhin
