Please correct me if I am wrong. Hive : converts query to Map Reduce job(s). Can work on large scale data irrespective of the size of result set. Impala : runs daemons across all data nodes to get results. no map-reduce job is launched. Good for queries with small result set. Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of query plans generated and physical operator selection both based on cluster characteristics.
On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <[email protected]> wrote: > I'm sorry to send this mail again. > I cannot understand why the lower part of the above mail is regarded as a > signature. > ===================================================== > > Hi, Tejas > > The key differences between Tajo and Impala is the design goal. To increase > the performance of query processing, Impala adopts an approach which the > main memory is utilized as much as possible and intermediate data are > transfered via streaming. If a query requires too much memory, Impala > cannot process the query. Thus, Impala says that it is not an alternate of > Hive. > > However, Tajo uses a query optimization which considers user queries, > characteristics of data, the status of cluster, and so on. Thus, Tajo can > process a query with Impala's algorithm, Hive's algorithm or any other > algorithms. For an example, Tajo can process a join query using the > repartition join, or the merge join. Intermediate results can be > materialized to disks or maintained in memory. Since Tajo builds a query > plan considering above mentioned various factors, it can always process > user queries. So, we can say that Tajo can be an alternate of Hive. > > Tajo can perform well over Hive for most of queries. The key reason is that > Tajo uses the own query engine while Hive uses MapReduce. This limits that > Hive can uses only MapReduce-based algorithms. However, Tajo can uses a > more optimized algorithm. > > A sort query is a good example. Hive supports only the hash partitioning. > Thus, each node sort data locally in the map phase and *ONE NODE* should > perform global sort in the reduce phase. > However, Tajo supports a sort algorithm using the range partitioning. In > the first phase, each node sort data locally as in Hive, but the > intermediate data are partitioned by the range of the sort key. In the > second phase, each node performs local sort to get the final results. Since > intermediate data are partitioned by the range of sort key, final results > are correct. > > If you have any questions about this, > please feel free to ask. > > Thanks, > Jihoon > > > > 2013/5/26 Jihoon Son <[email protected]> > > > Hi, Tejas > > > > The key differences between Tajo and Impala is the design goal. To > > increase the performance of query processing, Impala adopts an approach > > which the main memory is utilized as much as possible and intermediate > data > > are transfered via streaming. If a query requires too much memory, Impala > > cannot process the query. Thus, Impala says that it is not an alternate > of > > Hive. > > > > However, Tajo uses a query optimization which considers user queries, > > characteristics of data, the status of cluster, and so on. Thus, Tajo can > > process a query with Impala's algorithm, Hive's algorithm or any other > > algorithms. For an example, Tajo can process a join query using the > > repartition join, or the merge join. Intermediate results can be > > materialized to disks or maintained in memory. Since Tajo builds a query > > plan considering above mentioned various factors, it can always process > > user queries. So, we can say that Tajo can be an alternate of Hive. > > > > Tajo can perform well over Hive for most of queries. The key reason is > > that Tajo uses the own query engine while Hive uses MapReduce. This > limits > > that Hive can uses only MapReduce-based algorithms. However, Tajo can > uses > > a more optimized algorithm. > > > > A sort query is a good example. Hive supports only the hash partitioning. > > Thus, each node sort data locally in the map phase and*ONE NODE* should > > perform global sort in the reduce phase. > > However, Tajo supports a sort algorithm using the range partitioning. In > > the first phase, each node sort data locally as in Hive, but the > > intermediate data are partitioned by the range of the sort key. In the > > second phase, each node performs local sort to get the final results. > Since > > intermediate data are partitioned by the range of sort key, final results > > are correct. > > > > If you have any questions about this, > > please feel free to ask. > > > > Thanks, > > Jihoon > > > > > > 2013/5/26 Tejas Patil <[email protected]> > > > >> Hi @dev, > >> > >> Can anyone comment about the difference between Tajo, Hive and Impala ? > >> Also, what is the reason for Tajo to perform well over Hive ? In what > >> scenario would it be good to use Tajo ? and when would it be bad ? > >> > >> Thanks, > >> Tejas Patil > >> http://www.linkedin.com/in/tejaspatil1 > >> > > > > > > > > -- > > Jihoon Son > > > > Database & Information Systems Group, > > Prof. Yon Dohn Chung Lab. > > Dept. of Computer Science & Engineering, > > Korea University > > 1, 5-ga, Anam-dong, Seongbuk-gu, > > Seoul, 136-713, Republic of Korea > > > > Tel : +82-2-3290-3580 > > E-mail : [email protected] > > > > > > -- > Jihoon Son > > Database & Information Systems Group, > Prof. Yon Dohn Chung Lab. > Dept. of Computer Science & Engineering, > Korea University > 1, 5-ga, Anam-dong, Seongbuk-gu, > Seoul, 136-713, Republic of Korea > > Tel : +82-2-3290-3580 > E-mail : [email protected] >
