Re: Which [open-souce] SQL engine atop Hadoop?

Koert Kuipers Sat, 31 Jan 2015 08:11:22 -0800

Spark-SQL is read-only yes, in the sense that it does not support mutation
but only transformation to a new dataset that you store separately.


i am not aware of many systems that support mutation. systems that support
mutation will not use HDFS as the datastore. so something like Phoenix
(backed by HBase) will be needed for that.

On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <[email protected]> wrote:

> yes you can run whatever you like with the data in hdfs. keep in mind that
> hive makes this general access pattern just a little harder, since hive has
> a tendency to store data and metadata separately, with the metadata in a
> special metadata store (not on hdfs), and its not as easy for all systems
> to access hive metadata.
>
> i am not familiar at all with tajo or drill.
>
> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <[email protected]>
> wrote:
>
>> Thanks for the advice
>>
>> Koert: when everything is in the same essential data-store (HDFS), can't
>> I just run whatever complex tools I'm whichever paradigm they like?
>>
>> E.g.: GraphX, Mahout &etc.
>>
>> Also, what about Tajo or Drill?
>>
>> Best,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>> PS: Spark-SQL is read-only IIRC, right?
>> On 31 Jan 2015 03:39, "Koert Kuipers" <[email protected]> wrote:
>>
>>> since you require high-powered analytics, and i assume you want to stay
>>> sane while doing so, you require the ability to "drop out of sql" when
>>> needed. so spark-sql and lingual would be my choices.
>>>
>>> low latency indicates phoenix or spark-sql to me.
>>>
>>> so i would say spark-sql
>>>
>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <[email protected]>
>>> wrote:
>>>
>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does 
>>>> open-source
>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>>> open source Pivotal HD: HAWQ.
>>>>
>>>> So that doesn't meet my requirements. I should note that the project I
>>>> am building will also be open-source, which heightens the importance of
>>>> having all components also being open-source.
>>>>
>>>> Cheers,
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>> [email protected]> wrote:
>>>>
>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>> various commercial and/or open-source attempts to expose some 
>>>>> compatibility
>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am
>>>>> not expecting an unbiased answer.
>>>>>
>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>> querying, and supports the most common CRUD <https://spark.apache.org>,
>>>>> including [the basics!] along these lines: CREATE TABLE, INSERT INTO, 
>>>>> SELECT
>>>>> * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>>>> Transactional support would be nice also, but is not a must-have.
>>>>>
>>>>> Essentially I want a full replacement for the more traditional RDBMS,
>>>>> one which can scale from 1 node to a serious Hadoop cluster.
>>>>>
>>>>> Python is my language of choice for interfacing, however there does
>>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>>
>>>>> Here is what I've found thus far:
>>>>>
>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>    <https://spark.apache.org/sql>, queries only, add data via Hive,
>>>>>    RDD
>>>>>    
>>>>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>    or Paraquet <http://parquet.io/>)
>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>>>>    some built-in functions)
>>>>>    - Cloudera Impala
>>>>>    
>>>>> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and 
>>>>> INTERSECT;
>>>>>    amongst others)
>>>>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>>>>    query Hive, Cassandra <http://cassandra.apache.org>, relational
>>>>>    DBs &etc. Doesn't seem to be designed for low-latency responses across
>>>>>    small clusters, or support UPDATE operations. It is optimized for
>>>>>    data warehousing or analytics¹
>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>>    community edition <https://www.mapr.com/products/hadoop-download>
>>>>>    (seems to be a packaging of Hive, HP Vertica
>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>>>    interface and multi-dimensional analysis [OLAP
>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>>>>>    and supports most ANSI SQL query functions". It depends on HDFS, 
>>>>> MapReduce,
>>>>>    Hive and HBase; and seems targeted at very large data-sets though 
>>>>> maintains
>>>>>    low query latency)
>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>>>    support [benchmarks against Hive and Impala
>>>>>    
>>>>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>    ])
>>>>>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>>>    publishing files [or any resource] as schemas and tables.")
>>>>>
>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>> Thanks for all suggestions,
>>>>>
>>>>> Samuel Marks
>>>>> http://linkedin.com/in/samuelmarks
>>>>>
>>>>>
>>>>
>>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Reply via email to