If you are loading data once and performing multiple operations on it, Pig
should perform better due to its multiquery optimizations. If the data size
is very small there might not be a difference and you can go with what is
easy for you to code. I would suggest benchmarking with both Pig and Hive
and determine for yourself which works better for your use case.

Regards,
Rohini

On Thu, Jul 6, 2017 at 3:31 AM, Ramasubramanian Narayanan <
ramasubramanian.naraya...@gmail.com> wrote:

> Hi All,
>
> Pls help me on the below.
>
> *Use Case :*
> Trying to develop a framework to do Data profiling and Data Quality.
> Data is stored HIVE table stored in RC format.
> No join only considering DQ checks that can be done in a single table.
>
> *Need suggestion :*
> Thinking either to use PIG or HIVE for performing Data Quality and
> profiling. Need your suggestion on the same. Have listed few highlevel
> points which came to my mind.
>
> *Performance *:
> - HIVE will perform better or PIG ? In PIG can load the data set into a
> variable and can perform many operations on that data set. Will that
> improve any performance?
> - In HIVE, can have almost 70% of the checks in the same query. Like null,
> count, distinct count, duplicate count (total count - distinct count),
> length,etc., Even in this case, PIG will perform better or HIVE?
>
> *Coding *:
> - Though HIVE is easy to code than PIG, which one is most suitable for
> perfoming Data Quality and profiling
> *Open source tools:*
> - Pls Suggest any open source tools built on Java or some other
> technologies which can be integarated with Hadoop without any installation.
>
>
>
> regards,
> Rams
>

Reply via email to