Hi Stefán, great to hear your thoughts. I'll try to shed some light where I can.
> *Misc. observations:* > > - *Foreign key lookups (joins)* > - Coming from the traditional RDBM world I have a hard time wrapping my > head around how this can efficient/fast > Drill is primarily focused on analytical queries. This is as opposed to point lookups and single row queries. There are two main types of joins we typically see: fact-dimension and fact-fact. Fact-Dimension Joins (most common): In most big data scenarios, the fact table size is many magnitudes larger than a dimension table. In most of these cases, the construction of a hash table to hold the smaller dimension table is nominal overhead compared to the cost of probing the hash table. As such, traditional indices are less important. Fact-Fact: In fact to fact joins, things are likely to be slower (and you may need to spill to disk/use merge join). The good news is that Drill can use the aggregate memory of the entire cluster to perform the join. We're also working on a number of features that will vastly improve performance here including partition-aware bucket join. Even now, Drill does full parallelization of joins across nodes as well as within each node (many threads). As such, most people can scale-out to achieve the level of performance they desire (without additional features). > - Broadcasting is something that I need to understand a lot better > before committing :) > Broadcast joins are extremely useful when the you're doing a fact-dimension joins where the dimension table is reasonably small. By defualt, Drill caps this at 10,000,000 records for the dimension table. The alternative to broadcast join is doing a redistribution of both sides of the join to colocate sets of joins keys. This requires a full redistribution of data from both sides of the join. If the fact table is many times larger than the dimension table, this is frequently far more expensive than a broadcast join. The downside of the broadcast join is that each node must hold the whole dimension table in memory. The good news is that Drill will share this single dataset (and memory) across all threads working on the join for each node. > - Will it looking up a single value all files if not pruned? > It depends on the underlying datastore. If you're working with a system that has primary or secondary indexing, Drill is designed to take advantage of those and and will generally do a single record retrieval. However, data sources like CSV files and Parquet files don't have any indices. As such, you need to do a full scan to find a particular value. We're working with the Parquet community to add index capabilities to Parquet so we can also do single record and range retrievals efficiently. *Rows being loaded before filtering *- In some cases whole rows are loaded > before filtering is done (User defined functions indicate this) > - This seems to sacrifices many of the "column" trades from Parquet > Yes, this is a cost that can be optimized. (We have to leave some room to optimize Drill after 1.0, right :D ) That being said, we've built a custom Parquet reader that transforms directly from the columnar disk representation into our in-memory columnar representation. This is several times faster than the traditional Parquet reader. In most cases, this isn't a big issue for workloads. > - Partially helped by pruning and pre-selection (automatic for Parquet > files since latest 1.1 release) > We do some of this. More could be done. There are number of open JIRAs on this topic. > > - *Count(*) can be expensive* > - Document states: "slow on some formats that do not support row > count..." - I'm using Parquet and it seems to apply there > - Hint: It seems like using "count(columns[0])" instead of "count(*)" > may help - Slow count()ing seems like such a bad trade for an analytic > solution. > If you generate your Parquet files using Drill, Drill should be quick to return count(*). However, we've seen some systems generate Parquet files without setting the metadata of the number of records for each file. This would degrade performance as it would require a full scan. If you provide the output of the parquet tools head, we should be able to diagnose why this is a problem for your files. > - *Updating parquet files* > - Seems like adding individual rows is inefficient > Yes, Parquet was not natively built for this type of workload. > - Update / Insert/ Deleted seems to be scheduled for Drill 1.2 > There is no firm commitment about when these would be included. Adding them for row-level storage plugins such as HBase will probably happen before adding it for something like Parquet. > - What are best practices dealing with streaming data? > Can you expound on your use case? It really depends on what you mean by streaming data. > > - *Unique constraints* > - Ensuring uniqueness seems to be defined outside-the-scope-of-drill > (Parquet) > As of right now, Drill is a query layer as opposed to a database. Drill expects the underlying storage plugin/storage system to be responsible for maintaining things such as uniqueness constraints. This is especially true today since Drill doesn't yet support insert or update. > - *Views* > - Are parquet based views materialized and automatically updated? > Views are logical only and are executed each time a query above is run. The good news is that the view and the query utilizing it are optimized as a single relational plan so that we do only the work that is necessary for the actual output of the final query. > > - *ISO date support* > - Seems strange that iso dates are not directly supported (T between > date and time and a trailing timezone indicator) > - Came across this and somewhat agreed: "For example, the new extended > JSON support ($date) will parse a date such as '2015-01-01T00:22:00Z' > and convert it to the local time." > The SQL standard and the ISO standard are slightly different. Drill respsects the SQL standard when using SQL cast and literals. On the flipside, Extended JSON standard specifies using ISO dates as opposed to SQL dates. If you want to consume ISO dates in Drill, a simple UDF should suffice. (If you create one, please post as a patch and we'll include in a future release of Drill.) - *Histograms / Hyperloglog* > - Some analytics stores, like Druid, support histograms and HyperLogLog > for fast counting and cardinality estimations > - Why is this missing in Drill, is it planned? > Just haven't gotten to it yet. We will. > - Can it be achieved on top of Parquet > This is planned for Parquet but is not yet implemented. The Drill community is working with the Parquet community to try to achieve this in the near to medium term. > - *Some stories of SQL idiosyncrasies* - Found this in the mailing > archives and it made me smile: "Finally it worked. And the only thing I > had > to do was writing t2 join t1 instead of t1 join t2. I've changed nothing > else. And this really seems weird." - SQL support will surely mature > over time (Like not being able to include aliases in group by clause) > Drill already has the strongest SQL support of all the open source, real-time SQL engines today. Of course, as a 1.x product, we still have room to improve. I don't remember the join order item mentioned above. If this is not fixed, we should get it fixed. Let me know if there is a JIRA open or open one if you're still seeing this behavior. > - *Using S3... really?* - Is it efficient or according to best practices > to use S3 as a "data source"? - How efficiency is column scanning over > S3? > (Parquet ) > A lot of people have found this functionality useful in production. Remember, most of the analytical operations are more about large reads as opposed to IOPS. With enough parallelization, you can get very strong performance using S3. > - *Roadmap* - I only found the Drill roadmap in one presentation on > Slideshare (failed to save the link, sorry) - Issue tracker in Jira > provides roadmap indications :) - Is the roadmap available? > This is a hard one. As an open source, community-driven project, there is no formal roadmap. The features that get done are really focused on what is important to each of the contributors and/or their sponsoring organizations. In general, much of the 1.x series will be focused on performance, stability and reliability. More SQL support (like more types of window functions) will also be done. Beyond that, we'll have to see what everybody thinks is important. My personal agenda includes the items outlined on the dev list where we discussed the plan for the 1.2 release. Others can probably provide their own personal agendas. > - Mailgroups (and the standards Apache interface) - Any plans to use > Google groups or something a tiny bit more friendly? > Afraid not. As an Apache project, we need to stick with the Apache mailing list/infrastructure. - *Datta types* - is there an effective way to store UUIDs in Parquet > (Parquet question really and the answer seems to be no... not directly) > Not right now. Parquet does support fixed width binary fields so you could store a 16 byte field that held the UUID. That would be extremely efficient. Drill doesn't yet support generating a fixed width field for Parquet but it is something that will be added in the future. Drill should read the field no problem (as opaque VARBINARY) > - *Nested flatten* - There are currently some limitations to working > with multiple nested structures - issue: > https://issues.apache.org/jira/browse/DRILL-2783 This is an enhancement that no one has gotten to yet. Make sure to vote for it (and get your friends to vote for it) and we'll probably get to it sooner. > I look forward to working with Drill and hope it will be a suitable match > for our project. > Thanks for all your questions. I'm sure that others will find them very helpful. Let me know if my answers were helpful or if they simply created new questions :) Jacques