The problem being referred to was one where the type of the data changed and the order in which it was encountered made a difference. For files where the schema is known early (the only thing that ordinary SQL can handle), this won't happen. Also, the problem only occurred in nested data in which the schema was discovered on the fly which is also not supported in ordinary SQL engines.
That said, it should be handled well. On Sat, Jul 11, 2015 at 6:52 PM, Stefán Baxter <ste...@activitystream.com> wrote: > Hi Jacques, > > and thank you for answering swiftly and clearly :). > > Some additional questions did arise (see inline): > > > - *Foreign key lookups (joins)* > > > > I'm guessing my fk_lookup scenario would/could benefit from using other > storage options for that. > Currently most of this is in Postgres and a think I saw some mention of > supporting traditional data sources soon :) > > > - Partially helped by pruning and pre-selection (automatic for Parquet > > > files since latest 1.1 release) > > > > > > > We do some of this. More could be done. There are number of open JIRAs > on > > this topic. > > > > Yeah, I saw the one involving metadata caching. That seem quite > important. > > > > > > - *Count(*) can be expensive* > > > *Rows being loaded before filtering *- In some cases whole rows are loaded > > before filtering is done (User defined functions indicate this) > > - This seems to sacrifices many of the "column" trades from Parquet > > > > | Yes, this is a cost that can be optimized. (We have to leave some room > to > | optimize Drill after 1.0, right :D ) That being said, we've built a > custom > | Parquet reader that transforms directly from the columnar disk > | representation into our in-memory columnar representation. This is > several > | times faster than the traditional Parquet reader. In most cases, this > > It this custom Parquet reader enabled/available? > Would it work with remote storage, like S3? (I'm guessing not) > > isn't a big issue for workloads.If you generate your Parquet files using > > Drill, Drill should be quick to > > return count(*). However, we've seen some systems generate Parquet files > > without setting the metadata of the number of records for each file. > This > > would degrade performance as it would require a full scan. If you > provide > > the output of the parquet tools head, we should be able to diagnose why > > this is a problem for your files. > > > > Thank you, I will take you up on that if the problem prevails after I have > stopped making so many novice mistakes :) > > > > > - What are best practices dealing with streaming data? > > > > > > > Can you expound on your use case? It really depends on what you mean by > > streaming data. > > > > We are using Druid and ingest data into it from Kafka/RabbitMQ. It handles > segment creation (parquet equivalent) and mixes together new/fresh data, > not stored that way, and historical data that is stored in segments with > regular interval. > I do realize that Drill is not the workflow/ingestion tool but I wonder if > there are any guidelines to mixing json/other files with parquet and > especially the transition period from file->parquet to avoid duplicate > results or missing portions. > This may all become clear as I examine other tools that are suited for the > ingestion but it seems like Drill should have something since it has > directory based queries and seems to cater to these kind of things. > > > > > - *Views* > > > - Are parquet based views materialized and automatically updated? > > > > Views are logical only and are executed each time a query above is run. > > The good news is that the view and the query utilizing it are optimized > as > > a single relational plan so that we do only the work that is necessary > for > > the actual output of the final query. > > > > CTAS could also be used, I guess, to create daily, monthly aggregations for > historical (non changing) data. > Can it be used to add to table or does it require to create the whole table > every time? > I'm guessing that I'm asking the wrong question and that with a directory > based approach I would just add the new roll-up/aggregation table/file to > it's proper place. (If manual) > > Will the PARTION_BY clause prevent the table creation from deleting other > files-in-the-table if there is no overlap in the partition_by fields? > > - *Histograms / Hyperloglog* > > > - Some analytics stores, like Druid, support histograms and > > HyperLogLog > > > for fast counting and cardinality estimations > > > - Why is this missing in Drill, is it planned? > > > > > > > Just haven't gotten to it yet. We will. > > > > I saw something on this in the Parquet community and think this must be an > "in tandem" kin'o'thing. > > > > Not right now. Parquet does support fixed width binary fields so you > could > > store a 16 byte field that held the UUID. That would be extremely > > efficient. Drill doesn't yet support generating a fixed width field for > > Parquet but it is something that will be added in the future. Drill > should > > read the field no problem (as opaque VARBINARY) > > > > Can you please detail the difference and the potential gain once > fixed-width is supported? > > > > > - *Nested flatten* - There are currently some limitations to working > > > with multiple nested structures - issue: > > > https://issues.apache.org/jira/browse/DRILL-2783 > > > > > > This is an enhancement that no one has gotten to yet. Make sure to vote > > for it (and get your friends to vote for it) and we'll probably get to it > > sooner. > > > > Yeah, this must be a hot topic (I'm rooting for this one!) > > Jacques > > > > Thank you again for the prompt and clear answers. > > I'm quite impressed with both Drill and Parquet and look forward to dig > deeper :). > > Regards, > -Stefan >