A incomplete map to a maze - or - the jurney down the rabbit hole

Stefán Baxter Mon, 03 Aug 2015 16:45:32 -0700

Hi,

I have been meaning to write a blog post regarding our Drill experiments
but I though I might share some thoughts here first.
Hopefully some newbie can benefit from this and perhaps it sheds some light
on what drives newcomers in this community (or at least some part of them).


A bit of a back story. I work for a small startup that has been developing
it's application for *two long* but we hope to lunch our product in
Q1/2016.
We have been using Druid as a real-time analytic store and, even though
Druid does extremely well what it does, we are now forced to replace it to
allow for functionality that we can not be without.

We started to evaluate several solutions and we are now left with a handful
that we like. We have not had the time to evaluate all of them to extremes
but we feel comfortable with the general approach we have taken.

There were many things for us to consider and I want to run through them
here to seek advice from those here that know better. (I apologize for the
long read)

*Initial questions we had:*

   - Do we want to operate our own (distributed) file system or rely on
   something like S3?
   - Are there other options?

   - Do we favor "bringing the computation to the data" or "bring the data
   to the computation"?
   - For us Drill+S3 is a "bring data to the computation" solution while
   Hadoop+Hive+MapReduce is a "bringing the computation to the data" solution

   - How can we effectively merge streaming data with historical data and
   process both at the same time?
   - Something that Druid does very well

   - What is the best way to merge data from different sources in a
   polyglot data / format / sources environment?

   - How can we minimize the effect of large number of small tenants?
   - Warming up tenant specific indexes in Elastic Search can, for example,
   be costly (What a great solution though)

   - How can we evict long-tail data from fast storage and gradually move
   it into deep storage over time ?
   - ... while avoiding a lengthy warm-up period
   - ... while using deep storage also as backup even for fresh data

   - Can we cache aggregates or would we just be building a cache that
   next-to-nothing would ever hit?
   - aggregation for days, weeks, months with all the mostly used spins
   - historical data does not change here so one level of complexity is
   absent

   - Can we achieve sub-second results dealing with our tenant specific
   data sets
   - We will never be FB/Twitter but we still got some sizable chunks of
   data :)

   - Is there a stack available that already works like this?
   - MapR / BDAS etc. etc.

   - Can we build this using "pure" Apache components?
   - We would like to have the option of going without support/license-fees
   but we also like the option of professional services with SLAs and paid
   subscriptions

   - Is this all a premature optimization?
   - aka. "let's just use Elastic Search for now"


*Storage format*

We decided early on that the we liked Parquet <https://parquet.apache.org/>
enough to commit to it for our processed/historical data.
We know that ORC <https://orc.apache.org/> is also great but the we believe
that Parquet will, with its  support for nested structures, broad appeal
and interoperability become a de-facto standard before long.
(We really like ORC as well :) )

Parquet has so many things going for it so I jump to what we see as current
cons:

   - Parquet is does not compress as well as ORC
   - Seems to makes up for it in retrieval speed (
   
https://developer.ibm.com/hadoop/blog/2014/09/19/big-sql-3-0-file-formats-usage-performance/
   )

   - Parquet is still missing some things that can make it even better for
   analytics
   - Like: Bloom filters <https://issues.apache.org/jira/browse/PARQUET-41>,
   HyperLogLog <https://issues.apache.org/jira/browse/PARQUET-42>
    and indexing/sorting

   - Hive has some problems dealing with data in nested structures in
   Parquet files (but that may have changed)

   - It's ineffective to gradually add data to parquet files (or so we
   believe) so streaming data should be batched into Parquet files

Our streaming data is mixed schema (partially schemed and partially schema
less) so using Avro <https://avro.apache.org/> or Protobuff
<https://developers.google.com/protocol-buffers/> is doable but a tricky
while JSON is very doable but freakishly verbose.
Drill will, for example, allow us to join JSON data with Parquet data and
in our tests we have been merging log information, in JSON format, with
historical/older data stored in Parqet.
Querying log files like this is sub-optimal, even though they are stored on
PCIE-flash, so we still look for viable alternatives.

Worst case scenario for us is to continue to work with this log approach
but then using Avro.

Relevant Drill points:

   - +  Drill knows all the file formats we like to use (and then some)
   - -   Drill still misses RDBMs access (via JDBC) but that will show up
   soon enough (for us)


*Query engine*

We like SQL.

It's the key to so many systems, tools and resources that selecting a
proprietary query language is not an option (any more).
The query engines we know (of) in this space are:

   - Presto <https://prestodb.io/>
   - Apache Drill <https://drill.apache.org/>
   - Apache Spark <http://spark.apache.org/>

They all have in common the ability to run distributed queries and merge
data from different sources using different format, which is great!
Spark does, in addition to this, offer a range of capabilities
(ML/Graph/etc.) that are appealing to us.

We have not decided yet what to do but Drill is very appealing to us and we
have it running in the "production" system used for our dev. partners.

Relevant Drill points:

   - + Drill does SQL well and is getting better at dealing with
   evolving/dirty schema (aka. JSON curve balls)
   - + Drill is getting better and better in window functions and various
   aggregations
   - + Drill can be accessed via JDBC/ODBC and it has a Spark connection as
   well
   - + Drill does User Defined Functions (Could be a bit more accessible
   but hey, they are there!)

*Storage:*

It's very appealing to use Hive <https://hive.apache.org/> and Hadoop/HDFS
<https://hadoop.apache.org/> to store our data and Hive can point to data
in S3 as well as in other storage locations.
Many solutions know how to use the Hive catalog and some optimization seems
to be available the utilizes it.

Still we are not convinced that we want to build a hadoop
<https://hadoop.apache.org/> cluster (yeah, it's mostly prejudice)
We don't want to use S3 directly or be forced to live in the EC2 ecosystem
for it to make any operation sense. (yeah, mostly we are cheep)

Ideally we would like something like this:

   - Use S3 (or any other cost effective alternative) as deep storage

   - Cache new and in-demand-data locally to minimize the S3 traffic as
   much as possible (Large tired cache: t1:Flash, t2:SSD and t3:HDD)

   - We would like to use a file system that plays nice with the Parquet
   file format and dos the scans/reads required to effectively work with the
   columnar nature of it.
   we are still not sure if HDFS is the ideal match for parquet from this
   perspective but we think that both Gluster <http://www.gluster.org/> and
   Ceph <http://ceph.com/> could be (even though Ceph is a Blob/Object
   store like S3 underneath)

Relevant Drill Points (for file system based operations):

   - - All Drillbits (nodes) require access to the same data for
   distribution/planning to work ("bring the data to the computation")
   planner that knows about "single-node-data" and can plan for it would be
   ideal for alternatives like the one we have in mind

   - - Partition pruning does not work with directories.
   Partitioning data based on time span into a sub-directory structure has
   no benefits (Drill goes through them all, always)

We were quite optimistic when we came across Tachyon
<http://tachyon-project.org/> and got it to work with Drill (see earlier
email).
It seemed like the ideal bridge between Drill and S3 that would allow us to
do multi-tiered caching and smart sharing of content between Drillbits
(nodes) without fetching it multiple times from S3.
But Tachyon currently has a single drawback that we can not accept. It
stored all files in S3 in a proprietary way that does not reflect the "file
system" that it fronts.
Also; Accessing Tachyon seems to be a API only exercise and NFS access, for
example, does not exist.  (In that way Tachyon is wicket fast and slow (as
in dim) at the same time)

*To oversimplify - the choice is now between the beaten path
(MapR/BDAS/etc.) or ... *

*A)* the 12" mix - feat. file system only:

   - Query engine: *Drill*
   - Streaming/log data format: *Avro* / *JSON*
   - Historical Data Format: *Parquet*
   - Extra dimensions / enrichment information: *RDBM* / *Cassandra* /
   *Solr* / *More*
   - File System: *S3 + Tachyon* / *Gluster* / *Ceph*

*B)* the 6" mix - feat. Johnny Cash:

   - Query engine: *Drill / Presto / Spark / Spark + Drill*
   - Streaming/log data format: *Avro* / *JSON*
   - Historical Data Format: *Parquet*
   - Extra dimensions / enrichment information: *RDBM* / *Cassandra* /
   *Solr* / *More*
   - File System:
*Hadoop + Hive *


*General Drill observations that are relevant to us (partly redundant):*

   - REST API does only return JSON
   - Uncompressed and incorrectly typed (count(*) returns a number in
   quotes)
   - the result set is always known and returning Avro/Protobuf could be
   beneficial

   - Schema discovery is still a bit immature / fragile / temperamental
   - This is understandable and will change

   - Partition pruning does not work for directories
   - See point above (Storage)

   - Schema information is not available for data in the file system
   - This will likely change (at least the schema is created on query time)

   - Metadata queries are not available (data discovery queries)
   - Discovering fields, cardinality and composition for data that meets
   certain criteria
   - This is essential in building good UI for tenant specific data (data
   that varies between tenants)

I will, in the near future, try to involve this into a more informative
blog post and share it for those starting down this path.

There are is so much more going on in/around this space (Tez
<https://tez.apache.org/>, Flink <https://flink.apache.org/> ...)

All input is welcomed.

Best regards,
 -Stefan Baxter

PS. I'm a bit dyslexic and sometimes the little bastards sneak though. Just
laugh and shake your head when you come across them :)

A incomplete map to a maze - or - the jurney down the rabbit hole

Reply via email to