Re: difference between Tajo, Hive and Impala

Hyunsik Choi Tue, 28 May 2013 19:27:03 -0700

By 11 Jun, I also cannot spend enough time due to Ph.D oral defense =)

I leave a couple of things in which you may have interest.


Yarn-related parts and the DAG framework Refactoring
http://mail-archives.apache.org/mod_mbox/tajo-dev/201305.mbox/%3CCAM=XDd9DtUJ-JKXuMyVA4SMzWiXXNRZg29G=vx73nohxkyo...@mail.gmail.com%3E

Roadmap of Tajo
http://wiki.apache.org/tajo/Roadmap

Cost-based Optimizer for Tajo
https://issues.apache.org/jira/browse/TAJO-24

Best regards,
Hyunsik


On Tue, May 28, 2013 at 3:25 PM, Tejas Patil <[email protected]>wrote:

> Hi Hyunsik,
>
> Tajo is a very interesting system to me :) Working on a incubation project
> is awesome. I had started peeking in the codebase but I really didn't get
> ample time to continue with that. My quarter will end in 2-3 weeks. Do you
> have any suggestion about few Jiras that I could play around with ?
>
> Thanks,
> Tejas
>
>
> On Mon, May 27, 2013 at 11:02 PM, Hyunsik Choi <[email protected]> wrote:
>
> > Tejas,
> >
> > If so, Tajo is a very interesting system for you. I already know Asterix,
> > and it was very impressive for me. AsterixDB also looks very interesting.
> > I'll read it. Probably, your ideas which were adopted to AsterixDB can be
> > adopted to Tajo.
> >
> > I attach two links for Tajo paper [1] and poster [2]. I hope that you are
> > interested in them.
> >
> > [1] http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_ICDE_2013.pdf
> > [2]
> http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_Poster_ICDE_2013.png
> >
> > Thanks,
> > Hyunsik
> >
> >
> >
> > On Tue, May 28, 2013 at 2:44 PM, Tejas Patil <[email protected]> wrote:
> >
> > > Thanks Hyunsik and Owen.
> > >
> > > The DAG based approach of representing query plans is quite aligned
> with
> > > the system I have been working on as a part of my current study at UC,
> > > Irvine with Prof Mike Carey: AsterixDb [0]
> > >
> > > [0] : http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf
> > >
> > >
> > > On Mon, May 27, 2013 at 10:27 PM, Owen O'Malley <[email protected]>
> > > wrote:
> > >
> > > > On Mon, May 27, 2013 at 3:46 PM, Tejas Patil <
> [email protected]
> > > > >wrote:
> > > >
> > > > > Please correct me if I am wrong.
> > > > >
> > > > > Hive : converts query to Map Reduce job(s). Can work on large scale
> > > data
> > > > > irrespective of the size of result set.
> > > > >
> > > >
> > > > Hive will continue to support MapReduce, but it will also get support
> > for
> > > > Tez. Tez is an Apache project that is building an execution engine
> that
> > > > runs under Yarn. By running under Tez, instead of MapReduce, Hive
> will
> > > > gain:
> > > >   * Use one job instead of many and thus not let go of resources
> before
> > > the
> > > > query is done
> > > >   * Remove the hard synchronization barrier between jobs
> > > >   * Allow Hive to shuffle from memory instead of hard disk
> > > >
> > > >
> > > > > Impala : runs daemons across all data nodes to get results. no
> > > map-reduce
> > > > > job is launched. Good for queries with small result set.
> > > > > Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of
> > query
> > > > > plans generated and physical operator selection both based on
> cluster
> > > > > characteristics.
> > > > >
> > > > >
> > > > > On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <[email protected]>
> > > wrote:
> > > > >
> > > > > > I'm sorry to send this mail again.
> > > > > > I cannot understand why the lower part of the above mail is
> > regarded
> > > > as a
> > > > > > signature.
> > > > > > =====================================================
> > > > > >
> > > > > > Hi, Tejas
> > > > > >
> > > > > > The key differences between Tajo and Impala is the design goal.
> To
> > > > > increase
> > > > > > the performance of query processing, Impala adopts an approach
> > which
> > > > the
> > > > > > main memory is utilized as much as possible and intermediate data
> > are
> > > > > > transfered via streaming. If a query requires too much memory,
> > Impala
> > > > > > cannot process the query. Thus, Impala says that it is not an
> > > alternate
> > > > > of
> > > > > > Hive.
> > > > > >
> > > > > > However, Tajo uses a query optimization which considers user
> > queries,
> > > > > > characteristics of data, the status of cluster, and so on. Thus,
> > Tajo
> > > > can
> > > > > > process a query with Impala's algorithm, Hive's algorithm or any
> > > other
> > > > > > algorithms. For an example, Tajo can process a join query using
> the
> > > > > > repartition join, or the merge join. Intermediate results can be
> > > > > > materialized to disks or maintained in memory. Since Tajo builds
> a
> > > > query
> > > > > > plan considering above mentioned various factors, it can always
> > > process
> > > > > > user queries. So, we can say that Tajo can be an alternate of
> Hive.
> > > > > >
> > > > > > Tajo can perform well over Hive for most of queries. The key
> reason
> > > is
> > > > > that
> > > > > > Tajo uses the own query engine while Hive uses MapReduce. This
> > limits
> > > > > that
> > > > > > Hive can uses only MapReduce-based algorithms. However, Tajo can
> > > uses a
> > > > > > more optimized algorithm.
> > > > > >
> > > > > > A sort query is a good example. Hive supports only the hash
> > > > partitioning.
> > > > > > Thus, each node sort data locally in the map phase and *ONE NODE*
> > > > should
> > > > > > perform global sort in the reduce phase.
> > > > > > However, Tajo supports a sort algorithm using the range
> > partitioning.
> > > > In
> > > > > > the first phase, each node sort data locally as in Hive, but the
> > > > > > intermediate data are partitioned by the range of the sort key.
> In
> > > the
> > > > > > second phase, each node performs local sort to get the final
> > results.
> > > > > Since
> > > > > > intermediate data are partitioned by the range of sort key, final
> > > > results
> > > > > > are correct.
> > > > > >
> > > > > > If you have any questions about this,
> > > > > > please feel free to ask.
> > > > > >
> > > > > > Thanks,
> > > > > > Jihoon
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2013/5/26 Jihoon Son <[email protected]>
> > > > > >
> > > > > > > Hi, Tejas
> > > > > > >
> > > > > > > The key differences between Tajo and Impala is the design goal.
> > To
> > > > > > > increase the performance of query processing, Impala adopts an
> > > > approach
> > > > > > > which the main memory is utilized as much as possible and
> > > > intermediate
> > > > > > data
> > > > > > > are transfered via streaming. If a query requires too much
> > memory,
> > > > > Impala
> > > > > > > cannot process the query. Thus, Impala says that it is not an
> > > > alternate
> > > > > > of
> > > > > > > Hive.
> > > > > > >
> > > > > > > However, Tajo uses a query optimization which considers user
> > > queries,
> > > > > > > characteristics of data, the status of cluster, and so on.
> Thus,
> > > Tajo
> > > > > can
> > > > > > > process a query with Impala's algorithm, Hive's algorithm or
> any
> > > > other
> > > > > > > algorithms. For an example, Tajo can process a join query using
> > the
> > > > > > > repartition join, or the merge join. Intermediate results can
> be
> > > > > > > materialized to disks or maintained in memory. Since Tajo
> builds
> > a
> > > > > query
> > > > > > > plan considering above mentioned various factors, it can always
> > > > process
> > > > > > > user queries. So, we can say that Tajo can be an alternate of
> > Hive.
> > > > > > >
> > > > > > > Tajo can perform well over Hive for most of queries. The key
> > reason
> > > > is
> > > > > > > that Tajo uses the own query engine while Hive uses MapReduce.
> > This
> > > > > > limits
> > > > > > > that Hive can uses only MapReduce-based algorithms. However,
> Tajo
> > > can
> > > > > > uses
> > > > > > > a more optimized algorithm.
> > > > > > >
> > > > > > > A sort query is a good example. Hive supports only the hash
> > > > > partitioning.
> > > > > > > Thus, each node sort data locally in the map phase and*ONE
> NODE*
> > > > should
> > > > > > > perform global sort in the reduce phase.
> > > > > > > However, Tajo supports a sort algorithm using the range
> > > partitioning.
> > > > > In
> > > > > > > the first phase, each node sort data locally as in Hive, but
> the
> > > > > > > intermediate data are partitioned by the range of the sort key.
> > In
> > > > the
> > > > > > > second phase, each node performs local sort to get the final
> > > results.
> > > > > > Since
> > > > > > > intermediate data are partitioned by the range of sort key,
> final
> > > > > results
> > > > > > > are correct.
> > > > > > >
> > > > > > > If you have any questions about this,
> > > > > > > please feel free to ask.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jihoon
> > > > > > >
> > > > > > >
> > > > > > > 2013/5/26 Tejas Patil <[email protected]>
> > > > > > >
> > > > > > >> Hi @dev,
> > > > > > >>
> > > > > > >> Can anyone comment about the difference between Tajo, Hive and
> > > > Impala
> > > > > ?
> > > > > > >> Also, what is the reason for Tajo to perform well over Hive ?
> In
> > > > what
> > > > > > >> scenario would it be good to use Tajo ? and when would it be
> > bad ?
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Tejas Patil
> > > > > > >> http://www.linkedin.com/in/tejaspatil1
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jihoon Son
> > > > > > >
> > > > > > > Database & Information Systems Group,
> > > > > > > Prof. Yon Dohn Chung Lab.
> > > > > > > Dept. of Computer Science & Engineering,
> > > > > > > Korea University
> > > > > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > > > > Seoul, 136-713, Republic of Korea
> > > > > > >
> > > > > > > Tel : +82-2-3290-3580
> > > > > > > E-mail : [email protected]
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jihoon Son
> > > > > >
> > > > > > Database & Information Systems Group,
> > > > > > Prof. Yon Dohn Chung Lab.
> > > > > > Dept. of Computer Science & Engineering,
> > > > > > Korea University
> > > > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > > > Seoul, 136-713, Republic of Korea
> > > > > >
> > > > > > Tel : +82-2-3290-3580
> > > > > > E-mail : [email protected]
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: difference between Tajo, Hive and Impala

Reply via email to