Thanks, now with your confirmation that Tajo is good for Small Data (one machine) also, I will definitely try Apache Tajo in my use-cases :)
Best, Samuel Marks http://linkedin.com/in/samuelmarks On Tue, Feb 3, 2015 at 8:37 PM, Jihoon Son <[email protected]> wrote: > Here are my answers. > > 1. Are there some benchmarks you could link showing performance > comparisons with regular RDBMSs? > > You can expect a great performance for analytic workloads (mostly > consisting of full table scans and highly complex computations). SK > Telecom, which is a biggest telco in Korea, has successfully replaced their > RDBMS-based DW system with Tajo. In the below, I added some detailed > results of performance evaluation. > > I attached a file that contains the results of comparison with PostgreSQL. > The experiments were conducted by a student for his thesis (Please refer to > http://markmail.org/message/ox7qwe7ojizxwjak#query:+page:1+mid:l3kdaoomyg36abfc+state:results > ). > We also have the internal results against MySQL and Oracle. Unfortunately, > I cannot share these results, but Tajo was generally tens times faster than > them for analytic workloads. > Owing to the index scan, RDBMSs are much faster than Tajo when reading few > rows. However, we are also preparing the index support. Here are the > related issue (https://issues.apache.org/jira/browse/TAJO-1300), and > documents (http://tajo.apache.org/docs/devel/index_overview.html). > > Additionally, here is the performance comparison results with other > SQL-on-Hadoop systems. > Please refer to > http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space/ > . > > 2. Following off my first question, I see you recommending it for data > warehousing... what about for regular operations, what kinds of latency can > I expect? > > With 8 machines each of which is equipped with 1 disk, it takes about 70 > seconds to read 80 GB data. This will be linearly increased as adding more > disks per machine or adding more machines. > For more complex queries, I hope that the above links will help you. If > you need more, please feel free to ask us. > > 3. Does Tajo scale from 1 node to huge clusters, or does it have a larger > minimum requirement? > As Hyunsik said, Tajo can scale from 1 to huge clusters. SK Telecom > already operates a cluster of about hundreds of machines. > > Best Regards, > Jihoon > > On Tue Feb 03 2015 at 오후 5:27:20 Samuel Marks <[email protected]> > wrote: > >> Thanks Jihoon, >> >> Are there some benchmarks you could link showing performance comparisons >> with regular RDBMSs? >> >> Does Tajo scale from 1 node to huge clusters, or does it have a larger >> minimum requirement? >> >> Following off my first question, I see you recommending it for data >> warehousing... what about for regular operations, what kinds of latency can >> I expect? >> >> Best, >> >> Samuel Marks >> http://linkedin.com/in/samuelmarks >> On 03/02/2015 6:43 pm, "Jihoon Son" <[email protected]> wrote: >> >>> Hi Samuel, sorry for late response. >>> I'm Jihoon Son, a PMC member of Apache Tajo. >>> >>> Of course, I prefer Tajo. >>> This is because not only I'm working on it, but also it is really cool. >>> >>> Tajo is originally designed for data warehouse system. So, it can >>> provide a good analysis experience to users with low latency and high >>> throughput. >>> Thanks to its fault-tolerant and SQL-oriented nature, it can conduct >>> interactive ad-hoc analysis as well as batch execution efficiently. >>> An additional representative advantage of Tajo is the rich SQL support. >>> It "currently" supports most of popular features of SQL. >>> Meanwhile, unfortunately, it does not support transactions. >>> >>> So, if you need a data warehouse system, please consider Tajo as a >>> candidate. >>> >>> Sincrely, >>> Jihoon >>> >>> 2015-02-02 9:54 GMT+09:00 Samuel Marks <[email protected]>: >>> >>>> Since Hadoop <https://hive.apache.org> came out, there have been >>>> various commercial and/or open-source attempts to expose some compatibility >>>> with SQL <http://drill.apache.org>. Obviously by posting here I am not >>>> expecting an unbiased answer. >>>> >>>> Seeking an SQL-on-Hadoop offering which provides: low-latency querying, >>>> and supports the most common CRUD <https://spark.apache.org>, >>>> including [the basics!] along these lines: CREATE TABLE, INSERT INTO, >>>> SELECT >>>> * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. >>>> Transactional support would be nice also, but is not a must-have. >>>> >>>> Essentially I want a full replacement for the more traditional RDBMS, >>>> one which can scale from 1 node to a serious Hadoop cluster. >>>> >>>> Python is my language of choice for interfacing, however there does >>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>. >>>> >>>> Here is what I've found thus far: >>>> >>>> - Apache Hive <https://hive.apache.org> (SQL-like, with interactive >>>> SQL thanks to the Stinger initiative) >>>> - Apache Drill <http://drill.apache.org> (ANSI SQL support) >>>> - Apache Spark <https://spark.apache.org> (Spark SQL >>>> <https://spark.apache.org/sql>, queries only, add data via Hive, RDD >>>> >>>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD> >>>> or Paraquet <http://parquet.io/>) >>>> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache >>>> HBase <http://hbase.apache.org>, lacks full transaction >>>> <http://en.wikipedia.org/wiki/Database_transaction> support, relational >>>> operators <http://en.wikipedia.org/wiki/Relational_operators> and >>>> some built-in functions) >>>> - Cloudera Impala >>>> >>>> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html> >>>> (significant HiveQL support, some SQL language support, no support for >>>> indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT; >>>> amongst others) >>>> - Presto <https://github.com/facebook/presto> from Facebook (can >>>> query Hive, Cassandra <http://cassandra.apache.org>, relational DBs >>>> &etc. Doesn't seem to be designed for low-latency responses across small >>>> clusters, or support UPDATE operations. It is optimized for data >>>> warehousing or analytics¹ >>>> <http://prestodb.io/docs/current/overview/use-cases.html>) >>>> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR >>>> community edition <https://www.mapr.com/products/hadoop-download> >>>> (seems to be a packaging of Hive, HP Vertica >>>> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL, >>>> Drill and a native ODBC wrapper >>>> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>) >>>> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL >>>> interface and multi-dimensional analysis [OLAP >>>> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop >>>> and supports most ANSI SQL query functions". It depends on HDFS, >>>> MapReduce, >>>> Hive and HBase; and seems targeted at very large data-sets though >>>> maintains >>>> low query latency) >>>> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard >>>> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver >>>> support [benchmarks against Hive and Impala >>>> >>>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space> >>>> ]) >>>> - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s >>>> Lingual <http://docs.cascading.org/lingual/1.0/>² >>>> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual >>>> provides JDBC Drivers, a SQL command shell, and a catalog manager for >>>> publishing files [or any resource] as schemas and tables.") >>>> >>>> Which—from this list or elsewhere—would you recommend, and why? >>>> Thanks for all suggestions, >>>> >>>> Samuel Marks >>>> http://linkedin.com/in/samuelmarks >>>> >>> >>>
