Re: Which [open-souce] SQL engine atop Hadoop?

Hyunsik Choi Tue, 03 Feb 2015 01:19:36 -0800

Hi Samuel,

Tajo can be scaled from 1 node to huge clusters. There is no the
minimum requirement. Some users use a single Tajo node for their
analytics. Tajo is efficient and fast enough to be run single node.
According to internal tests, a single Tajo node outperforms MySQL up
to tens of times for analytical workloads. Some company provides its
desktop version at http://www.gruter.com/download.html. Of course,
MySQL is faster than Tajo for low-selectivity queries which can
exploit indexing techniques. But, there is no official benchmark
results which compare Tajo with regular RDBMSs. Both systems are
designed for different workloads. In my opinion, it may be hard to
compare fairly both systems.


Nevertheless, there is one benchmark result who some users carried out.
http://markmail.org/message/ox7qwe7ojizxwjak#query:+page:1+mid:l3kdaoomyg36abfc+state:results

Best regards,
Hyunsik

On Tue, Feb 3, 2015 at 12:25 AM, Samuel Marks <[email protected]> wrote:
> Thanks Jihoon,
>
> Are there some benchmarks you could link showing performance comparisons
> with regular RDBMSs?
>
> Does Tajo scale from 1 node to huge clusters, or does it have a larger
> minimum requirement?
>
> Following off my first question, I see you recommending it for data
> warehousing... what about for regular operations, what kinds of latency can
> I expect?
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> On 03/02/2015 6:43 pm, "Jihoon Son" <[email protected]> wrote:
>>
>> Hi Samuel, sorry for late response.
>> I'm Jihoon Son, a PMC member of Apache Tajo.
>>
>> Of course, I prefer Tajo.
>> This is because not only I'm working on it, but also it is really cool.
>>
>> Tajo is originally designed for data warehouse system. So, it can provide
>> a good analysis experience to users with low latency and high throughput.
>> Thanks to its fault-tolerant and SQL-oriented nature, it can conduct
>> interactive ad-hoc analysis as well as batch execution efficiently.
>> An additional representative advantage of Tajo is the rich SQL support.
>> It "currently" supports most of popular features of SQL.
>> Meanwhile, unfortunately, it does not support transactions.
>>
>> So, if you need a data warehouse system, please consider Tajo as a
>> candidate.
>>
>> Sincrely,
>> Jihoon
>>
>> 2015-02-02 9:54 GMT+09:00 Samuel Marks <[email protected]>:
>>>
>>> Since Hadoop came out, there have been various commercial and/or
>>> open-source attempts to expose some compatibility with SQL. Obviously by
>>> posting here I am not expecting an unbiased answer.
>>>
>>> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
>>> and supports the most common CRUD, including [the basics!] along these
>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET C1=2
>>> WHERE, DELETE FROM, and DROP TABLE. Transactional support would be nice
>>> also, but is not a must-have.
>>>
>>> Essentially I want a full replacement for the more traditional RDBMS, one
>>> which can scale from 1 node to a serious Hadoop cluster.
>>>
>>> Python is my language of choice for interfacing, however there does seem
>>> to be a Python JDBC wrapper.
>>>
>>> Here is what I've found thus far:
>>>
>>> Apache Hive (SQL-like, with interactive SQL thanks to the Stinger
>>> initiative)
>>> Apache Drill (ANSI SQL support)
>>> Apache Spark (Spark SQL, queries only, add data via Hive, RDD or
>>> Paraquet)
>>> Apache Phoenix (built atop Apache HBase, lacks full transaction support,
>>> relational operators and some built-in functions)
>>> Cloudera Impala (significant HiveQL support, some SQL language support,
>>> no support for indexes on its tables, importantly missing DELETE, UPDATE and
>>> INTERSECT; amongst others)
>>> Presto from Facebook (can query Hive, Cassandra, relational DBs &etc.
>>> Doesn't seem to be designed for low-latency responses across small clusters,
>>> or support UPDATE operations. It is optimized for data warehousing or
>>> analytics¹)
>>> SQL-Hadoop via MapR community edition (seems to be a packaging of Hive,
>>> HP Vertica, SparkSQL, Drill and a native ODBC wrapper)
>>> Apache Kylin from Ebay (provides an SQL interface and multi-dimensional
>>> analysis [OLAP], "… offers ANSI SQL on Hadoop and supports most ANSI SQL
>>> query functions". It depends on HDFS, MapReduce, Hive and HBase; and seems
>>> targeted at very large data-sets though maintains low query latency)
>>> Apache Tajo (ANSI/ISO SQL standard compliance with JDBC driver support
>>> [benchmarks against Hive and Impala])
>>> Cascading's Lingual² ("Lingual provides JDBC Drivers, a SQL command
>>> shell, and a catalog manager for publishing files [or any resource] as
>>> schemas and tables.")
>>>
>>> Which—from this list or elsewhere—would you recommend, and why?
>>>
>>> Thanks for all suggestions,
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Reply via email to