Thanks for the background Jinfeng, your explanation brings us back to the topic Arina raised: the state and direction of the Drill project.
For several years now, Drill has said, essentially, "dabble with raw data all you like early in the project, but for production, ETL your data into Parquet." One view is that we double-down on this idea. Parquet has many advantages: it carries its own schema. As Jinfeng noted, it is the product of an ETL process that cleans up and normalizes data, removing the variations that creep in during schema evolution. Further, Drill already has Ted's mechanism to pre-scan the data: Drill does it to capture the Parquet footer and directory metadata. There is a not-yet-committed project to gather stats on Parquet files. Parquet allows Drill to be schema-free (yes, a marketing term, really schema-on-read) with a good, solid schema as defined in Parquet. Of course, even Parquet is subject to ambiguities due to newly added columns, but ETL should clean up such issues. Extend that to other types: "schema-free" could mean, "Drill does not do schemas; each data source must provide its own CONSISTENT schema." Parquet does this today as does Hive. Allow external hints for CSV and JSON. Require that CSV and JSON have clean, extended relational schemas to avoid messy ambiguities and schema changes. An argument can be made that many fine ETL tools exist: NiFi, Spark, MR, Hive, StreamSets, ... Drill need not try to solve this problem. Further, those other products generally allow the user to use code to handle tough cases; Drill only has SQL, and often SQL is simply not sufficiently expressive. Here is a favorite example: {fields: [ {id: 1, name: "customer-name", type: "string", value: "fred"}, {id: 2, name: "balance", type: "float", value: "123.45"}] } It is unlikely that Drill can add enough UDFs to parse the above, and to do so efficiently. But, parsing the above in code (in Spark, say) is easy. So, perhaps Drill becomes a Parquet-focused query engine, requiring unambiguous schema, defined by the data source itself (which includes, for Parquet, Drill's metadata and stats files.) Drill is "schema-free", but the data is required to provide a clean, unambiguous schema. The only problem is that this big data niche is already occupied by an entrenched leader called Impala. Impala is faster than Drill (even if it is hard to build and is very hard to tune.) Drill can't win by saying, "we're almost as fast, have fewer bugs, and are easier to build." Spark didn't win by being almost as good as MR, it won because it was far better on many dimensions. Marketing 101 says that a product can never will a head-to-head battle with an entrenched leader. So, what niche can Drill fill that is not already occupied? Spark is great for complex data transforms of the kind shown above, but is not good at all for interactive SQL. (Spark is not a persistent engine, it is a batch system like MR. As noted in an earlier message, Spark shuffles data in stages, not in batches like Drill and Impala.) But, Spark has a huge community; maybe someone will solve these issues. So, Drill as a better Spark is not appealing. One idea that has come up from time-to-time is Drill as an open source Splunk-like tool. Splunk can ingest zillions of file formats using adapters, akin to Drill's storage and format plugins. Ingesting arbitrary files requires some schema cleansing on read. Drill's answer could be to add ad-hoc metadata, allow data-cleaning plugins, and add UDFs. That is, double-down on the idea that Drill does read multiple formats; solve the remaining issues to do so well. In short, the big question is, "what does Drill want to do now that its grown up?" Compete with Impala (Parquet only)? Complete with Spark (better code-based query engine)? Compete with Splunk (query any file format)? Something else? Whatever we do, to Weijie's point, we should do it in a way that is stable: today's approach to handling messy schema's can't ever work completely because it requires that Drill predict the future: a reader must decide on record 1 how to handle a field that won't actually appear until file (or block) 100. Do we need that? How do we maintain code (union vectors, list vectors, schema change) that never worked and probably never can? What is the better solution? Thanks, - Paul On Wednesday, August 15, 2018, 11:15:10 PM PDT, Jinfeng Ni <j...@apache.org> wrote: The use case Weijie described seems to fall into the category of traditional data warehouse, i.e, schemas are predefined by users, data strictly conforms to schema. Certainly this is one important uses, and I agreed that the schema-on-read logic in Drill run-time indeed is a disadvantage for such use case, compared with other SQL query engine like Impala/Presto. The question we want to ask is whether that's the only use case Drill wants to target. We probably want to hear more cases from Drill community, before we can decide what's the best strategy going forward. In examples Paul listed, why would two sets of data have different schema? In many cases, that's because application generating the data is changed; either adding/deleting one field, or modifying one existing field. ETL is a typical approach to clean up such data with different schema. Drill's argument, couple of years ago when the project was started, was that ETL is too time-consuming. it would provide great value if a query engine could query directly against such datasets. I feel Paul's suggestion of letting user provide schema, or Drill scan/probe and learn the schema seems to fall in the middle of spectrum; ETL is one extreme, and Drill's current schema-on-read is the other extreme. Personally, I would prefer letting Drill scan/probe the schema, as it might not be easy for user to provide schema in the case of nested data (will they have to provide type information for any nested field?). To Weijie's comment about complexity of code of dealing schema, in theory we should refactor/rewrite majority run-time operator, separating the logic of handling schema and handling regular data flow. That would clean up the current mess. ps1: IMHO, schema-less is purely PR word. The more appropriate word for Drill would be schema-on-read. 2: I would not call it a battle between non-relational data and relational engine. The extended relational model has type of array/composite types, similar to what Drill has. On Wed, Aug 15, 2018 at 7:27 PM, weijie tong <tongweijie...@gmail.com> wrote: > @Paul I really appreciate the statement ` Effort can go into new features > rather than fighting an unwinnable battle to use non-relational data in a > relational engine.` . > > At AntFinancial( known as Alipay an Alibaba related company ) we now use > Drill to support most of our analysis work. Our business and data is > complex enough. Our strategy is to let users design their schema first, > then dump in their data , query their data later. This work flow runs > fluently. But by deep inside into the Drill's code internal and see the > JIRA bugs, we will see most of the non-intuitive codes to solve the schema > change but really no help to most of the actual use case. I think this also > make the storage plugin interface not so intuitive to implement. > > We are sacrificing most of our work to pay for little income. Users really > don't care about defining a schema first, but pay attention whether their > query is fast enough. By probing the data to guess the schema and cache > them , to me ,is a compromise strategy but still not clean enough. So I > hope we move the mess schema solving logic out of Drill to let the code > cleaner by defining the schema firstly with DDL statements. If we agree on > this, the work should be a sub work of DRILL-6552. > > On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <par0...@yahoo.com.invalid> > wrote: > > > Hi Ted, > > > > I like the "schema auto-detect" idea. > > > > As we discussed in a prior thread, caching of schema is a nice-add on > once > > we have defined the schema-on-read mechanism. Maybe we first get it to > work > > with a user-provided schema. Then, as an enhancement, we offer to infer > the > > schema by scanning data. > > > > There are some ambiguities that schema inference can't resolve: in {x: > > "1002"} {x: 1003}, should x be an Int or a Varchar? > > > > Still if Drill could provide a guess at the schema, and the user could > > refine it, we'd have a very elegant solution. > > > > > > Thanks, > > - Paul > > > > > > > > On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning < > > ted.dunn...@gmail.com> wrote: > > > > This is a bold statement. > > > > And there are variants of it that could give users nearly the same > > experience that we have now. For instance, if we cache discovered schemas > > for old files and discover the schema for any new file that we see (and > > cache it) before actually running a query. That gives us pretty much the > > flexibility of schema on read without as much of the burden. > > > > > > > > On Wed, Aug 15, 2018 at 5:02 PM weijie tong <tongweijie...@gmail.com> > > wrote: > > > > > Hi all: > > > Hope the statement not seems too dash to you. > > > Drill claims be a schema-free distributed SQL engine. It pays lots of > > > work to make the execution engine to support it to support JSON file > like > > > storage format. It is easier to make bugs and let the code logic ugly. > I > > > wonder do we still insist on this ,since we are designing the metadata > > > system with DRILL-6552. > > > Traditionally, people is used to design its table schema firstly > > before > > > firing a SQL query. I don't think this saves people too much time. > Other > > > system like Spark is popular not due to lack the schema claiming. I > think > > > we should be brave enough to take the right decision whether to still > > > insist on this feature which seems not so important but a burden. > > > Thanks. > > > > > >