Re: Various ramblings of a newbie

Jacques Nadeau Sat, 11 Jul 2015 18:08:47 -0700

Hi Stefán, great to hear your thoughts.  I'll try to shed some light where
I can.



> *Misc. observations:*
>
>    - *Foreign key lookups (joins)*
>    - Coming from the traditional RDBM world I have a hard time wrapping my
>    head around how this can efficient/fast
>

Drill is primarily focused on analytical queries.  This is as opposed to
point lookups and single row queries.  There are two main types of joins we
typically see: fact-dimension and fact-fact.

Fact-Dimension Joins (most common): In most big data scenarios, the fact
table size is many magnitudes larger than a dimension table.   In most of
these cases, the construction of a hash table to hold the smaller dimension
table is nominal overhead compared to the cost of probing the hash table.
As such, traditional indices are less important.

Fact-Fact: In fact to fact joins, things are likely to be slower (and you
may need to spill to disk/use merge join). The good news is that Drill can
use the aggregate memory of the entire cluster to perform the join.  We're
also working on a number of features that will vastly improve performance
here including partition-aware bucket join. Even now, Drill does full
parallelization of joins across nodes as well as within each node (many
threads).  As such, most people can scale-out to achieve the level of
performance they desire (without additional features).


>    - Broadcasting is something that I need to understand a lot better
>    before committing :)
>

Broadcast joins are extremely useful when the you're doing a fact-dimension
joins where the dimension table is reasonably small.  By defualt, Drill
caps this at 10,000,000 records for the dimension table.  The alternative
to broadcast join is doing a redistribution of both sides of the join to
colocate sets of joins keys.  This requires a full redistribution of data
from both sides of the join.  If the fact table is many times larger than
the dimension table, this is frequently far more expensive than a broadcast
join.  The downside of the broadcast join is that each node must hold the
whole dimension table in memory.  The good news is that Drill will share
this single dataset (and memory) across all threads working on the join for
each node.


>    - Will it looking up a single value all files if not pruned?
>

It depends on the underlying datastore.  If you're working with a system
that has primary or secondary indexing, Drill is designed to take advantage
of those and and will generally do a single record retrieval.  However,
data sources like CSV files and Parquet files don't have any indices.  As
such, you need to do a full scan to find a particular value.  We're working
with the Parquet community to add index capabilities to Parquet so we can
also do single record and range retrievals efficiently.

*Rows being loaded before filtering *- In some cases whole rows are loaded
>    before filtering is done (User defined functions indicate this)
>    - This seems to sacrifices many of the "column" trades from Parquet
>

Yes, this is a cost that can be optimized.  (We have to leave some room to
optimize Drill after 1.0, right :D )  That being said, we've built a custom
Parquet reader that transforms directly from the columnar disk
representation into our in-memory columnar representation.  This is several
times faster than the traditional Parquet reader.  In most cases, this
isn't a big issue for workloads.


>    - Partially helped by pruning and pre-selection (automatic for Parquet
>    files since latest 1.1 release)
>

We do some of this.  More could be done. There are number of open JIRAs on
this topic.

>
>    - *Count(*) can be expensive*
>    - Document states: "slow on some formats that do not support row
>    count..." - I'm using Parquet and it seems to apply there
>    - Hint: It seems like using  "count(columns[0])" instead of "count(*)"
>    may help - Slow count()ing seems like such a bad trade for an analytic
>    solution.
>

If you generate your Parquet files using Drill, Drill should be quick to
return count(*). However, we've seen some systems generate Parquet files
without setting the metadata of the number of records for each file.  This
would degrade performance as it would require a full scan.  If you provide
the output of the parquet tools head, we should be able to diagnose why
this is a problem for your files.


>    - *Updating parquet files*
>    - Seems like adding individual rows is inefficient
>

Yes, Parquet was not natively built for this type of workload.


>    - Update / Insert/ Deleted seems to be scheduled for Drill 1.2
>

There is no firm commitment about when these would be included.  Adding
them for row-level storage plugins such as HBase will probably happen
before adding it for something like Parquet.


>    - What are best practices dealing with streaming data?
>

Can you expound on your use case? It really depends on what you mean by
streaming data.

>
>    - *Unique constraints*
>    - Ensuring uniqueness seems to be defined outside-the-scope-of-drill
>    (Parquet)
>

As of right now, Drill is a query layer as opposed to a database.  Drill
expects the underlying storage plugin/storage system to be responsible for
maintaining things such as uniqueness constraints.  This is especially true
today since Drill doesn't yet support insert or update.


>    - *Views*
>    - Are parquet based views materialized and automatically updated?
>

Views are logical only and are executed each time a query above is run.
The good news is that the view and the query utilizing it are optimized as
a single relational plan so that we do only the work that is necessary for
the actual output of the final query.

>
>    - *ISO date support*
>    - Seems strange that iso dates are not directly supported (T between
>    date and time and a trailing timezone indicator)
>    - Came across this and somewhat agreed: "For example, the new extended
>    JSON support ($date) will parse a date such as '2015-01-01T00:22:00Z'
>    and convert it to the local time."
>

The SQL standard and the ISO standard are slightly different.  Drill
respsects the SQL standard when using SQL cast and literals.  On the
flipside, Extended JSON standard specifies using ISO dates as opposed to
SQL dates. If you want to consume ISO dates in Drill, a simple UDF should
suffice.  (If you create one, please post as a patch and we'll include in a
future release of Drill.)

   - *Histograms / Hyperloglog*
>    - Some analytics stores, like Druid, support histograms and HyperLogLog
>    for fast counting and cardinality estimations
>    - Why is this missing in Drill, is it planned?
>

Just haven't gotten to it yet.  We will.


>    - Can it be achieved on top of Parquet
>

This is planned for Parquet but is not yet implemented.  The Drill
community is working with the Parquet community to try to achieve this in
the near to medium term.


>    - *Some stories of SQL idiosyncrasies* - Found this in the mailing
>    archives and it made me smile: "Finally it worked. And the only thing I
> had
>    to do was writing t2 join t1 instead of t1 join t2. I've changed nothing
>    else. And this really seems weird." - SQL support will surely mature
>    over time (Like not being able to include aliases in group by clause)
>

Drill already has the strongest SQL support of all the open source,
real-time SQL engines today.  Of course, as a 1.x product, we still have
room to improve.  I don't remember the join order item mentioned above.  If
this is not fixed, we should get it fixed. Let me know if there is a JIRA
open or open one if you're still seeing this behavior.


>    - *Using S3... really?* - Is it efficient or according to best practices
>    to use S3 as a "data source"? - How efficiency is column scanning over
> S3?
>    (Parquet )
>

A lot of people have found this functionality useful in production.
Remember, most of the analytical operations are more about large reads as
opposed to IOPS.  With enough parallelization, you can get very strong
performance using S3.


>    - *Roadmap* - I only found the Drill roadmap in one presentation on
>    Slideshare (failed to save the link, sorry) - Issue tracker in Jira
>    provides roadmap indications :) - Is the roadmap available?
>

This is a hard one.  As an open source, community-driven project, there is
no formal roadmap.  The features that get done are really focused on what
is important to each of the contributors and/or their sponsoring
organizations. In general, much of the 1.x series will be focused on
performance, stability and reliability.  More SQL support (like more types
of window functions) will also be done.  Beyond that, we'll have to see
what everybody thinks is important.  My personal agenda includes the items
outlined on the dev list where we discussed the plan for the 1.2 release.
Others can probably provide their own personal agendas.


>    - Mailgroups (and the standards Apache interface) - Any plans to use
>    Google groups or something a tiny bit more friendly?
>

Afraid not.  As an Apache project, we need to stick with the Apache mailing
list/infrastructure.

   - *Datta types* - is there an effective way to store UUIDs in Parquet
>    (Parquet question really and the answer seems to be no... not directly)
>

Not right now.  Parquet does support fixed width binary fields so you could
store a 16 byte field that held the UUID.  That would be extremely
efficient.  Drill doesn't yet support generating a fixed width field for
Parquet but it is something that will be added in the future.  Drill should
read the field no problem (as opaque VARBINARY)


>    - *Nested flatten* - There are currently some limitations to working
>    with multiple nested structures - issue:
>    https://issues.apache.org/jira/browse/DRILL-2783


This is an enhancement that no one has gotten to yet.  Make sure to vote
for it (and get your friends to vote for it) and we'll probably get to it
sooner.


> I look forward to working with Drill and hope it will be a suitable match
> for our project.
>

Thanks for all your questions.   I'm sure that others will find them very
helpful.  Let me know if my answers were helpful or if they simply created
new questions :)

Jacques

Re: Various ramblings of a newbie

Reply via email to