Hi,

I work in HBO’s Data Engineering team. We are evaluating multiple tools as part 
of implementing our Data Quality framework. I came across Griffin and it looks 
very promising. I have couple of doubts. It would be great if you can clarify 
them. And our use cases are mostly batch for now.


  1.  
https://cwiki.apache.org/confluence/display/GRIFFIN/7.+Apache+Griffin+DSL+Guidance
 states right now Griffin DSL supports only hive and avro as data source and 
hive, json and avro as data formats. We have other data sources/formats as 
well. So from the documentation what I understood is if Griffin DSL is not 
supported I can use spark-sql. Is that correct? So using spark sql can I do the 
similar kind of configuration for a parquet file residing in s3 and get the 
metrics?
  2.  Griffin persists the monitored metrics in elastic cache? If so can I 
configure it to use an elastic cache which is outside griffin docker? Can you 
point me to the documentation for that?
  3.  On the similar note, we are a complete aws shop, any of the active users 
use griffin in aws? Is there any documentation available? If griffin submits 
the spark job via livy, I think it should be okay even if we use emr right?
  4.  How can I do Completeness, Consistency and Validity measures? Is it a 
future road map item? If so when do you have an GA dates?

Also we are happy to contribute if it adds value to our requirement.

Thanks,
Nidhin

Reply via email to