Re: Which [open-souce] SQL engine atop Hadoop?

Samuel Marks Wed, 04 Feb 2015 22:08:44 -0800

Thanks, now with your confirmation that Tajo is good for Small Data (one
machine) also, I will definitely try Apache Tajo in my use-cases :)


Best,


Samuel Marks
http://linkedin.com/in/samuelmarks

On Tue, Feb 3, 2015 at 8:37 PM, Jihoon Son <[email protected]> wrote:

> Here are my answers.
>
> 1. Are there some benchmarks you could link showing performance
> comparisons with regular RDBMSs?
>
> You can expect a great performance for analytic workloads (mostly
> consisting of full table scans and highly complex computations). SK
> Telecom, which is a biggest telco in Korea, has successfully replaced their
> RDBMS-based DW system with Tajo. In the below, I added some detailed
> results of performance evaluation.
>
> I attached a file that contains the results of comparison with PostgreSQL.
> The experiments were conducted by a student for his thesis (Please refer to
> http://markmail.org/message/ox7qwe7ojizxwjak#query:+page:1+mid:l3kdaoomyg36abfc+state:results
> ).
> We also have the internal results against MySQL and Oracle. Unfortunately,
> I cannot share these results, but Tajo was generally tens times faster than
> them for analytic workloads.
> Owing to the index scan, RDBMSs are much faster than Tajo when reading few
> rows. However, we are also preparing the index support. Here are the
> related issue (https://issues.apache.org/jira/browse/TAJO-1300), and
> documents (http://tajo.apache.org/docs/devel/index_overview.html).
>
> Additionally, here is the performance comparison results with other
> SQL-on-Hadoop systems.
> Please refer to
> http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space/
> .
>
> 2. Following off my first question, I see you recommending it for data
> warehousing... what about for regular operations, what kinds of latency can
> I expect?
>
> With 8 machines each of which is equipped with 1 disk, it takes about 70
> seconds to read 80 GB data. This will be linearly increased as adding more
> disks per machine or adding more machines.
> For more complex queries, I hope that the above links will help you. If
> you need more, please feel free to ask us.
>
> 3. Does Tajo scale from 1 node to huge clusters, or does it have a larger
> minimum requirement?
> As Hyunsik said, Tajo can scale from 1 to huge clusters. SK Telecom
> already operates a cluster of about hundreds of machines.
>
> Best Regards,
> Jihoon
>
> On Tue Feb 03 2015 at 오후 5:27:20 Samuel Marks <[email protected]>
> wrote:
>
>> Thanks Jihoon,
>>
>> Are there some benchmarks you could link showing performance comparisons
>> with regular RDBMSs?
>>
>> Does Tajo scale from 1 node to huge clusters, or does it have a larger
>> minimum requirement?
>>
>> Following off my first question, I see you recommending it for data
>> warehousing... what about for regular operations, what kinds of latency can
>> I expect?
>>
>> Best,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>> On 03/02/2015 6:43 pm, "Jihoon Son" <[email protected]> wrote:
>>
>>> Hi Samuel, sorry for late response.
>>> I'm Jihoon Son, a PMC member of Apache Tajo.
>>>
>>> Of course, I prefer Tajo.
>>> This is because not only I'm working on it, but also it is really cool.
>>>
>>> Tajo is originally designed for data warehouse system. So, it can
>>> provide a good analysis experience to users with low latency and high
>>> throughput.
>>> Thanks to its fault-tolerant and SQL-oriented nature, it can conduct
>>> interactive ad-hoc analysis as well as batch execution efficiently.
>>> An additional representative advantage of Tajo is the rich SQL support.
>>> It "currently" supports most of popular features of SQL.
>>> Meanwhile, unfortunately, it does not support transactions.
>>>
>>> So, if you need a data warehouse system, please consider Tajo as a
>>> candidate.
>>>
>>> Sincrely,
>>> Jihoon
>>>
>>> 2015-02-02 9:54 GMT+09:00 Samuel Marks <[email protected]>:
>>>
>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>> various commercial and/or open-source attempts to expose some compatibility
>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am not
>>>> expecting an unbiased answer.
>>>>
>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
>>>> and supports the most common CRUD <https://spark.apache.org>,
>>>> including [the basics!] along these lines: CREATE TABLE, INSERT INTO, 
>>>> SELECT
>>>> * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>>> Transactional support would be nice also, but is not a must-have.
>>>>
>>>> Essentially I want a full replacement for the more traditional RDBMS,
>>>> one which can scale from 1 node to a serious Hadoop cluster.
>>>>
>>>> Python is my language of choice for interfacing, however there does
>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>
>>>> Here is what I've found thus far:
>>>>
>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>>>>    SQL thanks to the Stinger initiative)
>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>>>>    
>>>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>    or Paraquet <http://parquet.io/>)
>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>>>    some built-in functions)
>>>>    - Cloudera Impala
>>>>    
>>>> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>    amongst others)
>>>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>>>    query Hive, Cassandra <http://cassandra.apache.org>, relational DBs
>>>>    &etc. Doesn't seem to be designed for low-latency responses across small
>>>>    clusters, or support UPDATE operations. It is optimized for data
>>>>    warehousing or analytics¹
>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>    community edition <https://www.mapr.com/products/hadoop-download>
>>>>    (seems to be a packaging of Hive, HP Vertica
>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>>>>    Drill and a native ODBC wrapper
>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>>    interface and multi-dimensional analysis [OLAP
>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>>>>    and supports most ANSI SQL query functions". It depends on HDFS, 
>>>> MapReduce,
>>>>    Hive and HBase; and seems targeted at very large data-sets though 
>>>> maintains
>>>>    low query latency)
>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>>    support [benchmarks against Hive and Impala
>>>>    
>>>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>    ])
>>>>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>>    publishing files [or any resource] as schemas and tables.")
>>>>
>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>> Thanks for all suggestions,
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>
>>>

Re: Which [open-souce] SQL engine atop Hadoop?

Reply via email to