Re: Which [open-souce] SQL engine atop Hadoop?

Samuel Marks Tue, 03 Feb 2015 00:27:54 -0800

Thanks Jihoon,

Are there some benchmarks you could link showing performance comparisons
with regular RDBMSs?


Does Tajo scale from 1 node to huge clusters, or does it have a larger
minimum requirement?

Following off my first question, I see you recommending it for data
warehousing... what about for regular operations, what kinds of latency can
I expect?

Best,

Samuel Marks
http://linkedin.com/in/samuelmarks
On 03/02/2015 6:43 pm, "Jihoon Son" <[email protected]> wrote:

> Hi Samuel, sorry for late response.
> I'm Jihoon Son, a PMC member of Apache Tajo.
>
> Of course, I prefer Tajo.
> This is because not only I'm working on it, but also it is really cool.
>
> Tajo is originally designed for data warehouse system. So, it can provide
> a good analysis experience to users with low latency and high throughput.
> Thanks to its fault-tolerant and SQL-oriented nature, it can conduct
> interactive ad-hoc analysis as well as batch execution efficiently.
> An additional representative advantage of Tajo is the rich SQL support.
> It "currently" supports most of popular features of SQL.
> Meanwhile, unfortunately, it does not support transactions.
>
> So, if you need a data warehouse system, please consider Tajo as a
> candidate.
>
> Sincrely,
> Jihoon
>
> 2015-02-02 9:54 GMT+09:00 Samuel Marks <[email protected]>:
>
>> Since Hadoop <https://hive.apache.org> came out, there have been various
>> commercial and/or open-source attempts to expose some compatibility with
>> SQL <http://drill.apache.org>. Obviously by posting here I am not
>> expecting an unbiased answer.
>>
>> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
>> and supports the most common CRUD <https://spark.apache.org>, including
>> [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM,
>> UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional
>> support would be nice also, but is not a must-have.
>>
>> Essentially I want a full replacement for the more traditional RDBMS, one
>> which can scale from 1 node to a serious Hadoop cluster.
>>
>> Python is my language of choice for interfacing, however there does seem
>> to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>
>> Here is what I've found thus far:
>>
>>    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>>    SQL thanks to the Stinger initiative)
>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>>    
>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>    or Paraquet <http://parquet.io/>)
>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>>    <http://hbase.apache.org>, lacks full transaction
>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>    operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>    some built-in functions)
>>    - Cloudera Impala
>>    
>> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>    (significant HiveQL support, some SQL language support, no support for
>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>    amongst others)
>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>    query Hive, Cassandra <http://cassandra.apache.org>, relational DBs
>>    &etc. Doesn't seem to be designed for low-latency responses across small
>>    clusters, or support UPDATE operations. It is optimized for data
>>    warehousing or analytics¹
>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>    community edition <https://www.mapr.com/products/hadoop-download>
>>    (seems to be a packaging of Hive, HP Vertica
>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>>    Drill and a native ODBC wrapper
>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>    interface and multi-dimensional analysis [OLAP
>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>>    and supports most ANSI SQL query functions". It depends on HDFS, 
>> MapReduce,
>>    Hive and HBase; and seems targeted at very large data-sets though 
>> maintains
>>    low query latency)
>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>    support [benchmarks against Hive and Impala
>>    
>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>    ])
>>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>    publishing files [or any resource] as schemas and tables.")
>>
>> Which—from this list or elsewhere—would you recommend, and why?
>> Thanks for all suggestions,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Reply via email to