Re: difference between Tajo, Hive and Impala

Tejas Patil Mon, 27 May 2013 15:48:10 -0700

Please correct me if I am wrong.

Hive : converts query to Map Reduce job(s). Can work on large scale data
irrespective of the size of result set.
Impala : runs daemons across all data nodes to get results. no map-reduce
job is launched. Good for queries with small result set.
Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of query
plans generated and physical operator selection both based on cluster
characteristics.



On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <[email protected]> wrote:

> I'm sorry to send this mail again.
> I cannot understand why the lower part of the above mail is regarded as a
> signature.
> =====================================================
>
> Hi, Tejas
>
> The key differences between Tajo and Impala is the design goal. To increase
> the performance of query processing, Impala adopts an approach which the
> main memory is utilized as much as possible and intermediate data are
> transfered via streaming. If a query requires too much memory, Impala
> cannot process the query. Thus, Impala says that it is not an alternate of
> Hive.
>
> However, Tajo uses a query optimization which considers user queries,
> characteristics of data, the status of cluster, and so on. Thus, Tajo can
> process a query with Impala's algorithm, Hive's algorithm or any other
> algorithms. For an example, Tajo can process a join query using the
> repartition join, or the merge join. Intermediate results can be
> materialized to disks or maintained in memory. Since Tajo builds a query
> plan considering above mentioned various factors, it can always process
> user queries. So, we can say that Tajo can be an alternate of Hive.
>
> Tajo can perform well over Hive for most of queries. The key reason is that
> Tajo uses the own query engine while Hive uses MapReduce. This limits that
> Hive can uses only MapReduce-based algorithms. However, Tajo can uses a
> more optimized algorithm.
>
> A sort query is a good example. Hive supports only the hash partitioning.
> Thus, each node sort data locally in the map phase and *ONE NODE* should
> perform global sort in the reduce phase.
> However, Tajo supports a sort algorithm using the range partitioning. In
> the first phase, each node sort data locally as in Hive, but the
> intermediate data are partitioned by the range of the sort key. In the
> second phase, each node performs local sort to get the final results. Since
> intermediate data are partitioned by the range of sort key, final results
> are correct.
>
> If you have any questions about this,
> please feel free to ask.
>
> Thanks,
> Jihoon
>
>
>
> 2013/5/26 Jihoon Son <[email protected]>
>
> > Hi, Tejas
> >
> > The key differences between Tajo and Impala is the design goal. To
> > increase the performance of query processing, Impala adopts an approach
> > which the main memory is utilized as much as possible and intermediate
> data
> > are transfered via streaming. If a query requires too much memory, Impala
> > cannot process the query. Thus, Impala says that it is not an alternate
> of
> > Hive.
> >
> > However, Tajo uses a query optimization which considers user queries,
> > characteristics of data, the status of cluster, and so on. Thus, Tajo can
> > process a query with Impala's algorithm, Hive's algorithm or any other
> > algorithms. For an example, Tajo can process a join query using the
> > repartition join, or the merge join. Intermediate results can be
> > materialized to disks or maintained in memory. Since Tajo builds a query
> > plan considering above mentioned various factors, it can always process
> > user queries. So, we can say that Tajo can be an alternate of Hive.
> >
> > Tajo can perform well over Hive for most of queries. The key reason is
> > that Tajo uses the own query engine while Hive uses MapReduce. This
> limits
> > that Hive can uses only MapReduce-based algorithms. However, Tajo can
> uses
> > a more optimized algorithm.
> >
> > A sort query is a good example. Hive supports only the hash partitioning.
> > Thus, each node sort data locally in the map phase and*ONE NODE* should
> > perform global sort in the reduce phase.
> > However, Tajo supports a sort algorithm using the range partitioning. In
> > the first phase, each node sort data locally as in Hive, but the
> > intermediate data are partitioned by the range of the sort key. In the
> > second phase, each node performs local sort to get the final results.
> Since
> > intermediate data are partitioned by the range of sort key, final results
> > are correct.
> >
> > If you have any questions about this,
> > please feel free to ask.
> >
> > Thanks,
> > Jihoon
> >
> >
> > 2013/5/26 Tejas Patil <[email protected]>
> >
> >> Hi @dev,
> >>
> >> Can anyone comment about the difference between Tajo, Hive and Impala ?
> >> Also, what is the reason for Tajo to perform well over Hive ? In what
> >> scenario would it be good to use Tajo ? and when would it be bad ?
> >>
> >> Thanks,
> >> Tejas Patil
> >> http://www.linkedin.com/in/tejaspatil1
> >>
> >
> >
> >
> > --
> > Jihoon Son
> >
> > Database & Information Systems Group,
> > Prof. Yon Dohn Chung Lab.
> > Dept. of Computer Science & Engineering,
> > Korea University
> > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > Seoul, 136-713, Republic of Korea
> >
> > Tel : +82-2-3290-3580
> > E-mail : [email protected]
> >
>
>
>
> --
> Jihoon Son
>
> Database & Information Systems Group,
> Prof. Yon Dohn Chung Lab.
> Dept. of Computer Science & Engineering,
> Korea University
> 1, 5-ga, Anam-dong, Seongbuk-gu,
> Seoul, 136-713, Republic of Korea
>
> Tel : +82-2-3290-3580
> E-mail : [email protected]
>

Re: difference between Tajo, Hive and Impala

Reply via email to