By 11 Jun, I also cannot spend enough time due to Ph.D oral defense =) I leave a couple of things in which you may have interest.
Yarn-related parts and the DAG framework Refactoring http://mail-archives.apache.org/mod_mbox/tajo-dev/201305.mbox/%3CCAM=XDd9DtUJ-JKXuMyVA4SMzWiXXNRZg29G=vx73nohxkyo...@mail.gmail.com%3E Roadmap of Tajo http://wiki.apache.org/tajo/Roadmap Cost-based Optimizer for Tajo https://issues.apache.org/jira/browse/TAJO-24 Best regards, Hyunsik On Tue, May 28, 2013 at 3:25 PM, Tejas Patil <[email protected]>wrote: > Hi Hyunsik, > > Tajo is a very interesting system to me :) Working on a incubation project > is awesome. I had started peeking in the codebase but I really didn't get > ample time to continue with that. My quarter will end in 2-3 weeks. Do you > have any suggestion about few Jiras that I could play around with ? > > Thanks, > Tejas > > > On Mon, May 27, 2013 at 11:02 PM, Hyunsik Choi <[email protected]> wrote: > > > Tejas, > > > > If so, Tajo is a very interesting system for you. I already know Asterix, > > and it was very impressive for me. AsterixDB also looks very interesting. > > I'll read it. Probably, your ideas which were adopted to AsterixDB can be > > adopted to Tajo. > > > > I attach two links for Tajo paper [1] and poster [2]. I hope that you are > > interested in them. > > > > [1] http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_ICDE_2013.pdf > > [2] > http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_Poster_ICDE_2013.png > > > > Thanks, > > Hyunsik > > > > > > > > On Tue, May 28, 2013 at 2:44 PM, Tejas Patil <[email protected]> wrote: > > > > > Thanks Hyunsik and Owen. > > > > > > The DAG based approach of representing query plans is quite aligned > with > > > the system I have been working on as a part of my current study at UC, > > > Irvine with Prof Mike Carey: AsterixDb [0] > > > > > > [0] : http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf > > > > > > > > > On Mon, May 27, 2013 at 10:27 PM, Owen O'Malley <[email protected]> > > > wrote: > > > > > > > On Mon, May 27, 2013 at 3:46 PM, Tejas Patil < > [email protected] > > > > >wrote: > > > > > > > > > Please correct me if I am wrong. > > > > > > > > > > Hive : converts query to Map Reduce job(s). Can work on large scale > > > data > > > > > irrespective of the size of result set. > > > > > > > > > > > > > Hive will continue to support MapReduce, but it will also get support > > for > > > > Tez. Tez is an Apache project that is building an execution engine > that > > > > runs under Yarn. By running under Tez, instead of MapReduce, Hive > will > > > > gain: > > > > * Use one job instead of many and thus not let go of resources > before > > > the > > > > query is done > > > > * Remove the hard synchronization barrier between jobs > > > > * Allow Hive to shuffle from memory instead of hard disk > > > > > > > > > > > > > Impala : runs daemons across all data nodes to get results. no > > > map-reduce > > > > > job is launched. Good for queries with small result set. > > > > > Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of > > query > > > > > plans generated and physical operator selection both based on > cluster > > > > > characteristics. > > > > > > > > > > > > > > > On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <[email protected]> > > > wrote: > > > > > > > > > > > I'm sorry to send this mail again. > > > > > > I cannot understand why the lower part of the above mail is > > regarded > > > > as a > > > > > > signature. > > > > > > ===================================================== > > > > > > > > > > > > Hi, Tejas > > > > > > > > > > > > The key differences between Tajo and Impala is the design goal. > To > > > > > increase > > > > > > the performance of query processing, Impala adopts an approach > > which > > > > the > > > > > > main memory is utilized as much as possible and intermediate data > > are > > > > > > transfered via streaming. If a query requires too much memory, > > Impala > > > > > > cannot process the query. Thus, Impala says that it is not an > > > alternate > > > > > of > > > > > > Hive. > > > > > > > > > > > > However, Tajo uses a query optimization which considers user > > queries, > > > > > > characteristics of data, the status of cluster, and so on. Thus, > > Tajo > > > > can > > > > > > process a query with Impala's algorithm, Hive's algorithm or any > > > other > > > > > > algorithms. For an example, Tajo can process a join query using > the > > > > > > repartition join, or the merge join. Intermediate results can be > > > > > > materialized to disks or maintained in memory. Since Tajo builds > a > > > > query > > > > > > plan considering above mentioned various factors, it can always > > > process > > > > > > user queries. So, we can say that Tajo can be an alternate of > Hive. > > > > > > > > > > > > Tajo can perform well over Hive for most of queries. The key > reason > > > is > > > > > that > > > > > > Tajo uses the own query engine while Hive uses MapReduce. This > > limits > > > > > that > > > > > > Hive can uses only MapReduce-based algorithms. However, Tajo can > > > uses a > > > > > > more optimized algorithm. > > > > > > > > > > > > A sort query is a good example. Hive supports only the hash > > > > partitioning. > > > > > > Thus, each node sort data locally in the map phase and *ONE NODE* > > > > should > > > > > > perform global sort in the reduce phase. > > > > > > However, Tajo supports a sort algorithm using the range > > partitioning. > > > > In > > > > > > the first phase, each node sort data locally as in Hive, but the > > > > > > intermediate data are partitioned by the range of the sort key. > In > > > the > > > > > > second phase, each node performs local sort to get the final > > results. > > > > > Since > > > > > > intermediate data are partitioned by the range of sort key, final > > > > results > > > > > > are correct. > > > > > > > > > > > > If you have any questions about this, > > > > > > please feel free to ask. > > > > > > > > > > > > Thanks, > > > > > > Jihoon > > > > > > > > > > > > > > > > > > > > > > > > 2013/5/26 Jihoon Son <[email protected]> > > > > > > > > > > > > > Hi, Tejas > > > > > > > > > > > > > > The key differences between Tajo and Impala is the design goal. > > To > > > > > > > increase the performance of query processing, Impala adopts an > > > > approach > > > > > > > which the main memory is utilized as much as possible and > > > > intermediate > > > > > > data > > > > > > > are transfered via streaming. If a query requires too much > > memory, > > > > > Impala > > > > > > > cannot process the query. Thus, Impala says that it is not an > > > > alternate > > > > > > of > > > > > > > Hive. > > > > > > > > > > > > > > However, Tajo uses a query optimization which considers user > > > queries, > > > > > > > characteristics of data, the status of cluster, and so on. > Thus, > > > Tajo > > > > > can > > > > > > > process a query with Impala's algorithm, Hive's algorithm or > any > > > > other > > > > > > > algorithms. For an example, Tajo can process a join query using > > the > > > > > > > repartition join, or the merge join. Intermediate results can > be > > > > > > > materialized to disks or maintained in memory. Since Tajo > builds > > a > > > > > query > > > > > > > plan considering above mentioned various factors, it can always > > > > process > > > > > > > user queries. So, we can say that Tajo can be an alternate of > > Hive. > > > > > > > > > > > > > > Tajo can perform well over Hive for most of queries. The key > > reason > > > > is > > > > > > > that Tajo uses the own query engine while Hive uses MapReduce. > > This > > > > > > limits > > > > > > > that Hive can uses only MapReduce-based algorithms. However, > Tajo > > > can > > > > > > uses > > > > > > > a more optimized algorithm. > > > > > > > > > > > > > > A sort query is a good example. Hive supports only the hash > > > > > partitioning. > > > > > > > Thus, each node sort data locally in the map phase and*ONE > NODE* > > > > should > > > > > > > perform global sort in the reduce phase. > > > > > > > However, Tajo supports a sort algorithm using the range > > > partitioning. > > > > > In > > > > > > > the first phase, each node sort data locally as in Hive, but > the > > > > > > > intermediate data are partitioned by the range of the sort key. > > In > > > > the > > > > > > > second phase, each node performs local sort to get the final > > > results. > > > > > > Since > > > > > > > intermediate data are partitioned by the range of sort key, > final > > > > > results > > > > > > > are correct. > > > > > > > > > > > > > > If you have any questions about this, > > > > > > > please feel free to ask. > > > > > > > > > > > > > > Thanks, > > > > > > > Jihoon > > > > > > > > > > > > > > > > > > > > > 2013/5/26 Tejas Patil <[email protected]> > > > > > > > > > > > > > >> Hi @dev, > > > > > > >> > > > > > > >> Can anyone comment about the difference between Tajo, Hive and > > > > Impala > > > > > ? > > > > > > >> Also, what is the reason for Tajo to perform well over Hive ? > In > > > > what > > > > > > >> scenario would it be good to use Tajo ? and when would it be > > bad ? > > > > > > >> > > > > > > >> Thanks, > > > > > > >> Tejas Patil > > > > > > >> http://www.linkedin.com/in/tejaspatil1 > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Jihoon Son > > > > > > > > > > > > > > Database & Information Systems Group, > > > > > > > Prof. Yon Dohn Chung Lab. > > > > > > > Dept. of Computer Science & Engineering, > > > > > > > Korea University > > > > > > > 1, 5-ga, Anam-dong, Seongbuk-gu, > > > > > > > Seoul, 136-713, Republic of Korea > > > > > > > > > > > > > > Tel : +82-2-3290-3580 > > > > > > > E-mail : [email protected] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Jihoon Son > > > > > > > > > > > > Database & Information Systems Group, > > > > > > Prof. Yon Dohn Chung Lab. > > > > > > Dept. of Computer Science & Engineering, > > > > > > Korea University > > > > > > 1, 5-ga, Anam-dong, Seongbuk-gu, > > > > > > Seoul, 136-713, Republic of Korea > > > > > > > > > > > > Tel : +82-2-3290-3580 > > > > > > E-mail : [email protected] > > > > > > > > > > > > > > > > > > > > >
