RE: Which [open-souce] SQL engine atop Hadoop?

Andrew Brust Fri, 30 Jan 2015 15:38:52 -0800

Not sure Drill -- or any of the other SQL-on-Hadoop engines -- are truly 
well-suited to CRUD.  They excel at the "R" -- the "CUD" is not their forte.


-----Original Message-----
From: Samuel Marks [mailto:[email protected]] 
Sent: Friday, January 30, 2015 8:50 AM
To: [email protected]
Subject: Re: Which [open-souce] SQL engine atop Hadoop?

Dear Jacques,

Seeing the support for 03 SQL syntax, nested objects, and schema-free SQL in 
Apache Drill is quite impressive, not to mention the useful ODBC interface 
alongside the expected JDBC one. Additionally on the scalability side your 
documentation claims: "Scales from a single laptop to a 1000-node cluster".

You mention that this entire topic is subjective. I suppose with insufficient 
information about my use-case, you may just be right.

Without giving away my full use-case—FYI: I will be open-sourcing what I'm 
building—I will tell you a little bit about the components.

The generic components would just include CRUD, and basic related queries (such 
as propagated updates utilising joins).

More interesting is on the analytics side, wherein I'll be executing a variety 
of Machine Learning, information filtering (recommender systems, internal 
search engine most with some element of Natural Language Processing), time 
series sequence matching and related tasks. Some of these require near-realtime 
responses, whereas others can be delayed significantly.

I posted something similar to this on StackOverflow, it was very quickly 
removed. Haven't tried LinkedIn or Quora, probably worth a shot. Worried about 
speaking to enterprise sales people, as they're being paid to push their own 
offering (and I doubt they have extensive benchmarks across all their 
competitors).

Thanks for your continuing advice,

Samuel Marks
http://linkedin.com/in/samuelmarks

On Sat, Jan 31, 2015 at 12:22 AM, Jacques Nadeau <[email protected]> wrote:

> Samuel,
>
> You've come and asked your question on the Apache Drill group so of 
> course the answer is Apache Drill is best for everything, right?
>
> The reality is that each tool has a set of strengths and weaknesses 
> for each particular use case. An Apache user support mailing list is 
> definitely NOT the place to have this discussion.  You're really 
> asking for technology selection advice and this entire topic is very 
> subjective. The people in any one community would never do full 
> justice to all the options. As such I suggest you use another forum such as 
> Quora or LinkedIn to get advice.
> (There is also a helpful article on Gigaom that just came out 
> yesterday and all sorts of friendly sales people at companies like 
> MapR and IBM who love giving this kind of advice.)
>
> What we can do here is tell you how Drill can solve or not solve your 
> different use cases and help you work through those.  If you to go 
> into more detail, on those,  we'd be happy to help.
>
> Thanks again for the interest. Sorry if this seems abrupt but these 
> threads generally aren't productive and tend to be very divisive.
>
> Welcome to the community :)
>
> Jacques
> On Jan 30, 2015 3:28 AM, "Samuel Marks" <[email protected]> wrote:
>
> > Since Hadoop <https://hive.apache.org> came out, there have been 
> > various commercial and/or open-source attempts to expose some 
> > compatibility with SQL <http://drill.apache.org>. Obviously by 
> > posting here I am not expecting
> an
> > unbiased answer.
> >
> > Seeking an SQL-on-Hadoop offering which provides: low-latency 
> > querying,
> and
> > supports the most common CRUD <https://spark.apache.org>, including 
> > [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * 
> > FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. 
> > Transactional support would be nice also, but is not a must-have.
> >
> > Essentially I want a full replacement for the more traditional 
> > RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
> >
> > Python is my language of choice for interfacing, however there does 
> > seem
> to
> > be a Python JDBC wrapper <https://spark.apache.org/sql>.
> >
> > Here is what I've found thus far:
> >
> >    - Apache Hive <https://hive.apache.org> (SQL-like, with 
> > interactive
> SQL
> >    thanks to the Stinger initiative)
> >    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
> >    - Apache Spark <https://spark.apache.org> (Spark SQL
> >    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
> >    <
> >
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.s
> park.sql.SchemaRDD
> > >
> >    or Paraquet <http://parquet.io/>)
> >    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
> >    <http://hbase.apache.org>, lacks full transaction
> >    <http://en.wikipedia.org/wiki/Database_transaction> support,
> relational
> >    operators <http://en.wikipedia.org/wiki/Relational_operators> and
> some
> >    built-in functions)
> >    - Cloudera Impala
> >    <
> >
> http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/
> impala.html
> > >
> >    (significant HiveQL support, some SQL language support, no support for
> >    indexes on its tables, importantly missing DELETE, UPDATE and
> INTERSECT;
> >    amongst others)
> >    - Presto <https://github.com/facebook/presto> from Facebook (can
> query
> >    Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
> >    Doesn't seem to be designed for low-latency responses across small
> >    clusters, or support UPDATE operations. It is optimized for data
> >    warehousing or analytics¹
> >    <http://prestodb.io/docs/current/overview/use-cases.html>)
> >    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
> >    community edition <https://www.mapr.com/products/hadoop-download>
> > (seems
> >    to be a packaging of Hive, HP Vertica
> >    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
> >    Drill and a native ODBC wrapper
> >    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
> >    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
> >    interface and multi-dimensional analysis [OLAP
> >    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on 
> > Hadoop
> and
> >    supports most ANSI SQL query functions". It depends on HDFS,
> MapReduce,
> >    Hive and HBase; and seems targeted at very large data-sets though 
> > maintains
> >    low query latency)
> >    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard 
> > compliance
> >    with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support 
> > [benchmarks
> >    against Hive and Impala
> >    <
> >
> http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-
> hadoop-space
> > >
> >    ])
> >    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
> >    Lingual <http://docs.cascading.org/lingual/1.0/>²
> >    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
> provides
> >    JDBC Drivers, a SQL command shell, and a catalog manager for
> publishing
> >    files [or any resource] as schemas and tables.")
> >
> > Which—from this list or elsewhere—would you recommend, and why?
> > Thanks for all suggestions,
> >
> > Samuel Marks
> > http://linkedin.com/in/samuelmarks
> >
>

RE: Which [open-souce] SQL engine atop Hadoop?

Reply via email to