Hi All,

Pls help me on the below.

*Use Case :*
Trying to develop a framework to do Data profiling and Data Quality.
Data is stored HIVE table stored in RC format.
No join only considering DQ checks that can be done in a single table.

*Need suggestion :*
Thinking either to use PIG or HIVE for performing Data Quality and
profiling. Need your suggestion on the same. Have listed few highlevel
points which came to my mind.

*Performance *:
- HIVE will perform better or PIG ? In PIG can load the data set into a
variable and can perform many operations on that data set. Will that
improve any performance?
- In HIVE, can have almost 70% of the checks in the same query. Like null,
count, distinct count, duplicate count (total count - distinct count),
length,etc., Even in this case, PIG will perform better or HIVE?

*Coding *:
- Though HIVE is easy to code than PIG, which one is most suitable for
perfoming Data Quality and profiling
*Open source tools:*
- Pls Suggest any open source tools built on Java or some other
technologies which can be integarated with Hadoop without any installation.



regards,
Rams

Reply via email to