Hi All, Pls help me on the below.
*Use Case :* Trying to develop a framework to do Data profiling and Data Quality. Data is stored HIVE table stored in RC format. No join only considering DQ checks that can be done in a single table. *Need suggestion :* Thinking either to use PIG or HIVE for performing Data Quality and profiling. Need your suggestion on the same. Have listed few highlevel points which came to my mind. *Performance *: - HIVE will perform better or PIG ? In PIG can load the data set into a variable and can perform many operations on that data set. Will that improve any performance? - In HIVE, can have almost 70% of the checks in the same query. Like null, count, distinct count, duplicate count (total count - distinct count), length,etc., Even in this case, PIG will perform better or HIVE? *Coding *: - Though HIVE is easy to code than PIG, which one is most suitable for perfoming Data Quality and profiling *Open source tools:* - Pls Suggest any open source tools built on Java or some other technologies which can be integarated with Hadoop without any installation. regards, Rams