Re: difference between Tajo, Hive and Impala

Tejas Patil Mon, 27 May 2013 23:26:13 -0700

Hi Hyunsik,

Tajo is a very interesting system to me :) Working on a incubation project
is awesome. I had started peeking in the codebase but I really didn't get
ample time to continue with that. My quarter will end in 2-3 weeks. Do you
have any suggestion about few Jiras that I could play around with ?


Thanks,
Tejas


On Mon, May 27, 2013 at 11:02 PM, Hyunsik Choi <[email protected]> wrote:

> Tejas,
>
> If so, Tajo is a very interesting system for you. I already know Asterix,
> and it was very impressive for me. AsterixDB also looks very interesting.
> I'll read it. Probably, your ideas which were adopted to AsterixDB can be
> adopted to Tajo.
>
> I attach two links for Tajo paper [1] and poster [2]. I hope that you are
> interested in them.
>
> [1] http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_ICDE_2013.pdf
> [2] http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_Poster_ICDE_2013.png
>
> Thanks,
> Hyunsik
>
>
>
> On Tue, May 28, 2013 at 2:44 PM, Tejas Patil <[email protected]> wrote:
>
> > Thanks Hyunsik and Owen.
> >
> > The DAG based approach of representing query plans is quite aligned with
> > the system I have been working on as a part of my current study at UC,
> > Irvine with Prof Mike Carey: AsterixDb [0]
> >
> > [0] : http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf
> >
> >
> > On Mon, May 27, 2013 at 10:27 PM, Owen O'Malley <[email protected]>
> > wrote:
> >
> > > On Mon, May 27, 2013 at 3:46 PM, Tejas Patil <[email protected]
> > > >wrote:
> > >
> > > > Please correct me if I am wrong.
> > > >
> > > > Hive : converts query to Map Reduce job(s). Can work on large scale
> > data
> > > > irrespective of the size of result set.
> > > >
> > >
> > > Hive will continue to support MapReduce, but it will also get support
> for
> > > Tez. Tez is an Apache project that is building an execution engine that
> > > runs under Yarn. By running under Tez, instead of MapReduce, Hive will
> > > gain:
> > >   * Use one job instead of many and thus not let go of resources before
> > the
> > > query is done
> > >   * Remove the hard synchronization barrier between jobs
> > >   * Allow Hive to shuffle from memory instead of hard disk
> > >
> > >
> > > > Impala : runs daemons across all data nodes to get results. no
> > map-reduce
> > > > job is launched. Good for queries with small result set.
> > > > Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of
> query
> > > > plans generated and physical operator selection both based on cluster
> > > > characteristics.
> > > >
> > > >
> > > > On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <[email protected]>
> > wrote:
> > > >
> > > > > I'm sorry to send this mail again.
> > > > > I cannot understand why the lower part of the above mail is
> regarded
> > > as a
> > > > > signature.
> > > > > =====================================================
> > > > >
> > > > > Hi, Tejas
> > > > >
> > > > > The key differences between Tajo and Impala is the design goal. To
> > > > increase
> > > > > the performance of query processing, Impala adopts an approach
> which
> > > the
> > > > > main memory is utilized as much as possible and intermediate data
> are
> > > > > transfered via streaming. If a query requires too much memory,
> Impala
> > > > > cannot process the query. Thus, Impala says that it is not an
> > alternate
> > > > of
> > > > > Hive.
> > > > >
> > > > > However, Tajo uses a query optimization which considers user
> queries,
> > > > > characteristics of data, the status of cluster, and so on. Thus,
> Tajo
> > > can
> > > > > process a query with Impala's algorithm, Hive's algorithm or any
> > other
> > > > > algorithms. For an example, Tajo can process a join query using the
> > > > > repartition join, or the merge join. Intermediate results can be
> > > > > materialized to disks or maintained in memory. Since Tajo builds a
> > > query
> > > > > plan considering above mentioned various factors, it can always
> > process
> > > > > user queries. So, we can say that Tajo can be an alternate of Hive.
> > > > >
> > > > > Tajo can perform well over Hive for most of queries. The key reason
> > is
> > > > that
> > > > > Tajo uses the own query engine while Hive uses MapReduce. This
> limits
> > > > that
> > > > > Hive can uses only MapReduce-based algorithms. However, Tajo can
> > uses a
> > > > > more optimized algorithm.
> > > > >
> > > > > A sort query is a good example. Hive supports only the hash
> > > partitioning.
> > > > > Thus, each node sort data locally in the map phase and *ONE NODE*
> > > should
> > > > > perform global sort in the reduce phase.
> > > > > However, Tajo supports a sort algorithm using the range
> partitioning.
> > > In
> > > > > the first phase, each node sort data locally as in Hive, but the
> > > > > intermediate data are partitioned by the range of the sort key. In
> > the
> > > > > second phase, each node performs local sort to get the final
> results.
> > > > Since
> > > > > intermediate data are partitioned by the range of sort key, final
> > > results
> > > > > are correct.
> > > > >
> > > > > If you have any questions about this,
> > > > > please feel free to ask.
> > > > >
> > > > > Thanks,
> > > > > Jihoon
> > > > >
> > > > >
> > > > >
> > > > > 2013/5/26 Jihoon Son <[email protected]>
> > > > >
> > > > > > Hi, Tejas
> > > > > >
> > > > > > The key differences between Tajo and Impala is the design goal.
> To
> > > > > > increase the performance of query processing, Impala adopts an
> > > approach
> > > > > > which the main memory is utilized as much as possible and
> > > intermediate
> > > > > data
> > > > > > are transfered via streaming. If a query requires too much
> memory,
> > > > Impala
> > > > > > cannot process the query. Thus, Impala says that it is not an
> > > alternate
> > > > > of
> > > > > > Hive.
> > > > > >
> > > > > > However, Tajo uses a query optimization which considers user
> > queries,
> > > > > > characteristics of data, the status of cluster, and so on. Thus,
> > Tajo
> > > > can
> > > > > > process a query with Impala's algorithm, Hive's algorithm or any
> > > other
> > > > > > algorithms. For an example, Tajo can process a join query using
> the
> > > > > > repartition join, or the merge join. Intermediate results can be
> > > > > > materialized to disks or maintained in memory. Since Tajo builds
> a
> > > > query
> > > > > > plan considering above mentioned various factors, it can always
> > > process
> > > > > > user queries. So, we can say that Tajo can be an alternate of
> Hive.
> > > > > >
> > > > > > Tajo can perform well over Hive for most of queries. The key
> reason
> > > is
> > > > > > that Tajo uses the own query engine while Hive uses MapReduce.
> This
> > > > > limits
> > > > > > that Hive can uses only MapReduce-based algorithms. However, Tajo
> > can
> > > > > uses
> > > > > > a more optimized algorithm.
> > > > > >
> > > > > > A sort query is a good example. Hive supports only the hash
> > > > partitioning.
> > > > > > Thus, each node sort data locally in the map phase and*ONE NODE*
> > > should
> > > > > > perform global sort in the reduce phase.
> > > > > > However, Tajo supports a sort algorithm using the range
> > partitioning.
> > > > In
> > > > > > the first phase, each node sort data locally as in Hive, but the
> > > > > > intermediate data are partitioned by the range of the sort key.
> In
> > > the
> > > > > > second phase, each node performs local sort to get the final
> > results.
> > > > > Since
> > > > > > intermediate data are partitioned by the range of sort key, final
> > > > results
> > > > > > are correct.
> > > > > >
> > > > > > If you have any questions about this,
> > > > > > please feel free to ask.
> > > > > >
> > > > > > Thanks,
> > > > > > Jihoon
> > > > > >
> > > > > >
> > > > > > 2013/5/26 Tejas Patil <[email protected]>
> > > > > >
> > > > > >> Hi @dev,
> > > > > >>
> > > > > >> Can anyone comment about the difference between Tajo, Hive and
> > > Impala
> > > > ?
> > > > > >> Also, what is the reason for Tajo to perform well over Hive ? In
> > > what
> > > > > >> scenario would it be good to use Tajo ? and when would it be
> bad ?
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Tejas Patil
> > > > > >> http://www.linkedin.com/in/tejaspatil1
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jihoon Son
> > > > > >
> > > > > > Database & Information Systems Group,
> > > > > > Prof. Yon Dohn Chung Lab.
> > > > > > Dept. of Computer Science & Engineering,
> > > > > > Korea University
> > > > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > > > Seoul, 136-713, Republic of Korea
> > > > > >
> > > > > > Tel : +82-2-3290-3580
> > > > > > E-mail : [email protected]
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jihoon Son
> > > > >
> > > > > Database & Information Systems Group,
> > > > > Prof. Yon Dohn Chung Lab.
> > > > > Dept. of Computer Science & Engineering,
> > > > > Korea University
> > > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > > Seoul, 136-713, Republic of Korea
> > > > >
> > > > > Tel : +82-2-3290-3580
> > > > > E-mail : [email protected]
> > > > >
> > > >
> > >
> >
>

Re: difference between Tajo, Hive and Impala

Reply via email to