Re: difference between Tajo, Hive and Impala

Hyunsik Choi Mon, 27 May 2013 23:03:12 -0700

Tejas,

If so, Tajo is a very interesting system for you. I already know Asterix,
and it was very impressive for me. AsterixDB also looks very interesting.
I'll read it. Probably, your ideas which were adopted to AsterixDB can be
adopted to Tajo.


I attach two links for Tajo paper [1] and poster [2]. I hope that you are
interested in them.

[1] http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_ICDE_2013.pdf
[2] http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_Poster_ICDE_2013.png

Thanks,
Hyunsik



On Tue, May 28, 2013 at 2:44 PM, Tejas Patil <[email protected]> wrote:

> Thanks Hyunsik and Owen.
>
> The DAG based approach of representing query plans is quite aligned with
> the system I have been working on as a part of my current study at UC,
> Irvine with Prof Mike Carey: AsterixDb [0]
>
> [0] : http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf
>
>
> On Mon, May 27, 2013 at 10:27 PM, Owen O'Malley <[email protected]>
> wrote:
>
> > On Mon, May 27, 2013 at 3:46 PM, Tejas Patil <[email protected]
> > >wrote:
> >
> > > Please correct me if I am wrong.
> > >
> > > Hive : converts query to Map Reduce job(s). Can work on large scale
> data
> > > irrespective of the size of result set.
> > >
> >
> > Hive will continue to support MapReduce, but it will also get support for
> > Tez. Tez is an Apache project that is building an execution engine that
> > runs under Yarn. By running under Tez, instead of MapReduce, Hive will
> > gain:
> >   * Use one job instead of many and thus not let go of resources before
> the
> > query is done
> >   * Remove the hard synchronization barrier between jobs
> >   * Allow Hive to shuffle from memory instead of hard disk
> >
> >
> > > Impala : runs daemons across all data nodes to get results. no
> map-reduce
> > > job is launched. Good for queries with small result set.
> > > Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of query
> > > plans generated and physical operator selection both based on cluster
> > > characteristics.
> > >
> > >
> > > On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <[email protected]>
> wrote:
> > >
> > > > I'm sorry to send this mail again.
> > > > I cannot understand why the lower part of the above mail is regarded
> > as a
> > > > signature.
> > > > =====================================================
> > > >
> > > > Hi, Tejas
> > > >
> > > > The key differences between Tajo and Impala is the design goal. To
> > > increase
> > > > the performance of query processing, Impala adopts an approach which
> > the
> > > > main memory is utilized as much as possible and intermediate data are
> > > > transfered via streaming. If a query requires too much memory, Impala
> > > > cannot process the query. Thus, Impala says that it is not an
> alternate
> > > of
> > > > Hive.
> > > >
> > > > However, Tajo uses a query optimization which considers user queries,
> > > > characteristics of data, the status of cluster, and so on. Thus, Tajo
> > can
> > > > process a query with Impala's algorithm, Hive's algorithm or any
> other
> > > > algorithms. For an example, Tajo can process a join query using the
> > > > repartition join, or the merge join. Intermediate results can be
> > > > materialized to disks or maintained in memory. Since Tajo builds a
> > query
> > > > plan considering above mentioned various factors, it can always
> process
> > > > user queries. So, we can say that Tajo can be an alternate of Hive.
> > > >
> > > > Tajo can perform well over Hive for most of queries. The key reason
> is
> > > that
> > > > Tajo uses the own query engine while Hive uses MapReduce. This limits
> > > that
> > > > Hive can uses only MapReduce-based algorithms. However, Tajo can
> uses a
> > > > more optimized algorithm.
> > > >
> > > > A sort query is a good example. Hive supports only the hash
> > partitioning.
> > > > Thus, each node sort data locally in the map phase and *ONE NODE*
> > should
> > > > perform global sort in the reduce phase.
> > > > However, Tajo supports a sort algorithm using the range partitioning.
> > In
> > > > the first phase, each node sort data locally as in Hive, but the
> > > > intermediate data are partitioned by the range of the sort key. In
> the
> > > > second phase, each node performs local sort to get the final results.
> > > Since
> > > > intermediate data are partitioned by the range of sort key, final
> > results
> > > > are correct.
> > > >
> > > > If you have any questions about this,
> > > > please feel free to ask.
> > > >
> > > > Thanks,
> > > > Jihoon
> > > >
> > > >
> > > >
> > > > 2013/5/26 Jihoon Son <[email protected]>
> > > >
> > > > > Hi, Tejas
> > > > >
> > > > > The key differences between Tajo and Impala is the design goal. To
> > > > > increase the performance of query processing, Impala adopts an
> > approach
> > > > > which the main memory is utilized as much as possible and
> > intermediate
> > > > data
> > > > > are transfered via streaming. If a query requires too much memory,
> > > Impala
> > > > > cannot process the query. Thus, Impala says that it is not an
> > alternate
> > > > of
> > > > > Hive.
> > > > >
> > > > > However, Tajo uses a query optimization which considers user
> queries,
> > > > > characteristics of data, the status of cluster, and so on. Thus,
> Tajo
> > > can
> > > > > process a query with Impala's algorithm, Hive's algorithm or any
> > other
> > > > > algorithms. For an example, Tajo can process a join query using the
> > > > > repartition join, or the merge join. Intermediate results can be
> > > > > materialized to disks or maintained in memory. Since Tajo builds a
> > > query
> > > > > plan considering above mentioned various factors, it can always
> > process
> > > > > user queries. So, we can say that Tajo can be an alternate of Hive.
> > > > >
> > > > > Tajo can perform well over Hive for most of queries. The key reason
> > is
> > > > > that Tajo uses the own query engine while Hive uses MapReduce. This
> > > > limits
> > > > > that Hive can uses only MapReduce-based algorithms. However, Tajo
> can
> > > > uses
> > > > > a more optimized algorithm.
> > > > >
> > > > > A sort query is a good example. Hive supports only the hash
> > > partitioning.
> > > > > Thus, each node sort data locally in the map phase and*ONE NODE*
> > should
> > > > > perform global sort in the reduce phase.
> > > > > However, Tajo supports a sort algorithm using the range
> partitioning.
> > > In
> > > > > the first phase, each node sort data locally as in Hive, but the
> > > > > intermediate data are partitioned by the range of the sort key. In
> > the
> > > > > second phase, each node performs local sort to get the final
> results.
> > > > Since
> > > > > intermediate data are partitioned by the range of sort key, final
> > > results
> > > > > are correct.
> > > > >
> > > > > If you have any questions about this,
> > > > > please feel free to ask.
> > > > >
> > > > > Thanks,
> > > > > Jihoon
> > > > >
> > > > >
> > > > > 2013/5/26 Tejas Patil <[email protected]>
> > > > >
> > > > >> Hi @dev,
> > > > >>
> > > > >> Can anyone comment about the difference between Tajo, Hive and
> > Impala
> > > ?
> > > > >> Also, what is the reason for Tajo to perform well over Hive ? In
> > what
> > > > >> scenario would it be good to use Tajo ? and when would it be bad ?
> > > > >>
> > > > >> Thanks,
> > > > >> Tejas Patil
> > > > >> http://www.linkedin.com/in/tejaspatil1
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jihoon Son
> > > > >
> > > > > Database & Information Systems Group,
> > > > > Prof. Yon Dohn Chung Lab.
> > > > > Dept. of Computer Science & Engineering,
> > > > > Korea University
> > > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > > Seoul, 136-713, Republic of Korea
> > > > >
> > > > > Tel : +82-2-3290-3580
> > > > > E-mail : [email protected]
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jihoon Son
> > > >
> > > > Database & Information Systems Group,
> > > > Prof. Yon Dohn Chung Lab.
> > > > Dept. of Computer Science & Engineering,
> > > > Korea University
> > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > Seoul, 136-713, Republic of Korea
> > > >
> > > > Tel : +82-2-3290-3580
> > > > E-mail : [email protected]
> > > >
> > >
> >
>

Re: difference between Tajo, Hive and Impala

Reply via email to