Thanks, doesn't Drill have '03 support? Phoenix does seem good, my main reason for not jumping to it immediately is its additional degree of indirection (HBase; which IIRC: Splice also has).
And although most of these are analytical databases, that doesn't necessitate high latency. Though it may require a decent cluster, which is definitely worth considering (I need to scale from 1 node to many). Cheers, Samuel Marks http://linkedin.com/in/samuelmarks On 31 Jan 2015 03:57, "Vladimir Rodionov" <[email protected]> wrote: > Or SpliceDB ( not open-source though), but it provides full TX , ANSI > SQL-99 support and can run TPCC/TPCH full. > > Disclaimer: I work for Splice Machine. > > -Vlad > > On Fri, Jan 30, 2015 at 8:25 AM, Vladimir Rodionov <[email protected] > > wrote: > >> I think Phoenix the only option you have. All other products (projects) >> are analytical databases (or OLAP). If you need record - level operation >> support and indexes - Phoenix. >> >> -Vlad >> >> On Fri, Jan 30, 2015 at 3:26 AM, Samuel Marks <[email protected]> >> wrote: >> >>> Since Hadoop <https://hive.apache.org> came out, there have been >>> various commercial and/or open-source attempts to expose some compatibility >>> with SQL <http://drill.apache.org>. Obviously by posting here I am not >>> expecting an unbiased answer. >>> >>> Seeking an SQL-on-Hadoop offering which provides: low-latency querying, >>> and supports the most common CRUD <https://spark.apache.org>, including >>> [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * >>> FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. >>> Transactional support would be nice also, but is not a must-have. >>> >>> Essentially I want a full replacement for the more traditional RDBMS, >>> one which can scale from 1 node to a serious Hadoop cluster. >>> >>> Python is my language of choice for interfacing, however there does seem >>> to be a Python JDBC wrapper <https://spark.apache.org/sql>. >>> >>> Here is what I've found thus far: >>> >>> - Apache Hive <https://hive.apache.org> (SQL-like, with interactive >>> SQL thanks to the Stinger initiative) >>> - Apache Drill <http://drill.apache.org> (ANSI SQL support) >>> - Apache Spark <https://spark.apache.org> (Spark SQL >>> <https://spark.apache.org/sql>, queries only, add data via Hive, RDD >>> >>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD> >>> or Paraquet <http://parquet.io/>) >>> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase >>> <http://hbase.apache.org>, lacks full transaction >>> <http://en.wikipedia.org/wiki/Database_transaction> support, relational >>> operators <http://en.wikipedia.org/wiki/Relational_operators> and >>> some built-in functions) >>> - Cloudera Impala >>> >>> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html> >>> (significant HiveQL support, some SQL language support, no support for >>> indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT; >>> amongst others) >>> - Presto <https://github.com/facebook/presto> from Facebook (can >>> query Hive, Cassandra <http://cassandra.apache.org>, relational DBs >>> &etc. Doesn't seem to be designed for low-latency responses across small >>> clusters, or support UPDATE operations. It is optimized for data >>> warehousing or analytics¹ >>> <http://prestodb.io/docs/current/overview/use-cases.html>) >>> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR >>> community edition <https://www.mapr.com/products/hadoop-download> >>> (seems to be a packaging of Hive, HP Vertica >>> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL, >>> Drill and a native ODBC wrapper >>> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>) >>> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL >>> interface and multi-dimensional analysis [OLAP >>> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop >>> and supports most ANSI SQL query functions". It depends on HDFS, >>> MapReduce, >>> Hive and HBase; and seems targeted at very large data-sets though >>> maintains >>> low query latency) >>> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard >>> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver >>> support [benchmarks against Hive and Impala >>> >>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space> >>> ]) >>> - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s >>> Lingual <http://docs.cascading.org/lingual/1.0/>² >>> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual >>> provides JDBC Drivers, a SQL command shell, and a catalog manager for >>> publishing files [or any resource] as schemas and tables.") >>> >>> Which—from this list or elsewhere—would you recommend, and why? >>> Thanks for all suggestions, >>> >>> Samuel Marks >>> http://linkedin.com/in/samuelmarks >>> >> >> >
