Thanks Jihoon, Are there some benchmarks you could link showing performance comparisons with regular RDBMSs?
Does Tajo scale from 1 node to huge clusters, or does it have a larger minimum requirement? Following off my first question, I see you recommending it for data warehousing... what about for regular operations, what kinds of latency can I expect? Best, Samuel Marks http://linkedin.com/in/samuelmarks On 03/02/2015 6:43 pm, "Jihoon Son" <[email protected]> wrote: > Hi Samuel, sorry for late response. > I'm Jihoon Son, a PMC member of Apache Tajo. > > Of course, I prefer Tajo. > This is because not only I'm working on it, but also it is really cool. > > Tajo is originally designed for data warehouse system. So, it can provide > a good analysis experience to users with low latency and high throughput. > Thanks to its fault-tolerant and SQL-oriented nature, it can conduct > interactive ad-hoc analysis as well as batch execution efficiently. > An additional representative advantage of Tajo is the rich SQL support. > It "currently" supports most of popular features of SQL. > Meanwhile, unfortunately, it does not support transactions. > > So, if you need a data warehouse system, please consider Tajo as a > candidate. > > Sincrely, > Jihoon > > 2015-02-02 9:54 GMT+09:00 Samuel Marks <[email protected]>: > >> Since Hadoop <https://hive.apache.org> came out, there have been various >> commercial and/or open-source attempts to expose some compatibility with >> SQL <http://drill.apache.org>. Obviously by posting here I am not >> expecting an unbiased answer. >> >> Seeking an SQL-on-Hadoop offering which provides: low-latency querying, >> and supports the most common CRUD <https://spark.apache.org>, including >> [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, >> UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional >> support would be nice also, but is not a must-have. >> >> Essentially I want a full replacement for the more traditional RDBMS, one >> which can scale from 1 node to a serious Hadoop cluster. >> >> Python is my language of choice for interfacing, however there does seem >> to be a Python JDBC wrapper <https://spark.apache.org/sql>. >> >> Here is what I've found thus far: >> >> - Apache Hive <https://hive.apache.org> (SQL-like, with interactive >> SQL thanks to the Stinger initiative) >> - Apache Drill <http://drill.apache.org> (ANSI SQL support) >> - Apache Spark <https://spark.apache.org> (Spark SQL >> <https://spark.apache.org/sql>, queries only, add data via Hive, RDD >> >> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD> >> or Paraquet <http://parquet.io/>) >> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase >> <http://hbase.apache.org>, lacks full transaction >> <http://en.wikipedia.org/wiki/Database_transaction> support, relational >> operators <http://en.wikipedia.org/wiki/Relational_operators> and >> some built-in functions) >> - Cloudera Impala >> >> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html> >> (significant HiveQL support, some SQL language support, no support for >> indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT; >> amongst others) >> - Presto <https://github.com/facebook/presto> from Facebook (can >> query Hive, Cassandra <http://cassandra.apache.org>, relational DBs >> &etc. Doesn't seem to be designed for low-latency responses across small >> clusters, or support UPDATE operations. It is optimized for data >> warehousing or analytics¹ >> <http://prestodb.io/docs/current/overview/use-cases.html>) >> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR >> community edition <https://www.mapr.com/products/hadoop-download> >> (seems to be a packaging of Hive, HP Vertica >> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL, >> Drill and a native ODBC wrapper >> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>) >> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL >> interface and multi-dimensional analysis [OLAP >> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop >> and supports most ANSI SQL query functions". It depends on HDFS, >> MapReduce, >> Hive and HBase; and seems targeted at very large data-sets though >> maintains >> low query latency) >> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard >> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver >> support [benchmarks against Hive and Impala >> >> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space> >> ]) >> - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s >> Lingual <http://docs.cascading.org/lingual/1.0/>² >> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual >> provides JDBC Drivers, a SQL command shell, and a catalog manager for >> publishing files [or any resource] as schemas and tables.") >> >> Which—from this list or elsewhere—would you recommend, and why? >> Thanks for all suggestions, >> >> Samuel Marks >> http://linkedin.com/in/samuelmarks >> > >
