Hello, A lot of versioning problems arise when trying to share data through kafka between multiple applications with different lifecycles and maintainers, since by default, a single message in Kafka is just a blob. One way to solve that is to agree on a single serialization format, friendly with a record per record storage (like avro) and in order to not have to serialize the schema in use for every message, just reference an entry in the Avro Schema Registry (this flow is described here: https://medium.com/@stephane.maarek/introduction-to-schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321 ). On top of the schema registry, specific client libs allow to validate the message structure prior to the injection in kafka. So while comcast mentions the usage of an Avro Schema to describe its feeds, it does not mention directly the usage of avro files (to describe the schema).
Coming back to Drill, I think it tries to make a nice effort to provide similar features on top of loosely typed datasets. It could probably try to do better in some cases (handling unknown types as `Unknown` is probably better than `Nullable Int`), but its ability to dynamically merge data with different (but still compatible) schemas is really nice. When using untyped file formats (JSON, CSV), Drill does its best, and while it is not perfect, but it is already pretty good. When relying on types formats like Parquet /ORC / Avro, lot of problems are solved because each file describes its columns (name / types), allowing even for complex structures. But the usage of CSV/JSON still is problematic. I like the idea of having an optional way to describe the expected types somewhere (either in a central meta-store, or in a structured file next to the dataset). That would make the usage of CTAS much safer/easier (sometimes, we have to use Spark to generate the Parquet files because of schema/type problems). Independently from the meta-store, it is a bit annoying that Drill would need to `discover` the columns and types at every scan through trial and error, and cannot benefit from the previous queries. Extending the `Analyze Table` command so that meta-data could be generated from JSON/CSV file/folder could improve this situation without introducing a costly/painful ETL process. Regards, Joel On Wed, Apr 4, 2018 at 10:35 PM, Jinfeng Ni <j...@apache.org> wrote: > I feel it's probably premature to cal it "death of schema-on-read" just > based on one application case. For one product I have been working on > recently, one use case is for IOT related application where data is sent > from a variety of small devices (sensors, camera, etc). It would be a hard > requirement to pre-define schema upfront for each device, before write data > into the system. Further, the value of data is likely to decrease > significantly over time; data within hours/days is way more important than > that of weeks/months ago. It's unimaginable to wait for weeks to run data > clean/preparation job, before user could query such data. In other words, > for application with requirements of flexibility and time-sensitivity, > 'schema-on-read' provides a huge benefit, compared with traditional > ETL-then-query approach. > > Drill's schema-on-read is actually trying to solve a rather hard problem, > in that we deal with not only relational type, but also nested type. In > that sense, Drill is walking in an uncharted territory where not many > others are doing similar things. Dealing with undocumented/unstructured > data is a big challenge. Although Drill's solution is not perfect, IMHO, > it's still a big step towards such a problem. > > With that said, I agreed with points people raised earlier. In addition to > "schema-on-read", Drill has to do a better to handle the traditional cases > where schema is known beforehand, by introducing a meta-store /catalog, or > by allowing users to declare schema upfront ( I probably will not call > Drill "schema-forbidden"). The restart strategy seems to be also > interesting to handle failure caused by missing schema / schema change. > > > > > On Tue, Apr 3, 2018 at 10:01 PM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > Well, the restart strategy still works for your examples. And you only > pay > > once. From them you look at the cached type information and used an upper > > bound data type as you read the data. Since it works to read the values > in > > the right order, it is obviously possible to push down typing information > > even into the json reader. > > > > > > > > On Tue, Apr 3, 2018, 21:42 Paul Rogers <par0...@yahoo.com.invalid> > wrote: > > > > > Subtle point. I can provide schema with Parquet, as you note. > (Actually, > > > for Parquet, Drill is schema-required: I can't not provide a schema due > > to > > > the nature of Parquet...) > > > > > > But, I can't provide a schema for JSON, CSV, etc. The point is, Drill > > > forbids the user from providing a schema; only the file format itself > can > > > provide the schema (or not, in the case of JSON). This is the very > heart > > of > > > the problem. > > > > > > The root cause of our schema change exception is that vectors are, > > indeed, > > > strongly typed. But, file columns are not. Here is my favorite: > > > > > > {x: 10} {x: 10.1} > > > > > > Blam! Query fails because the vector is chosen as BigInt, then we > > discover > > > it really should have been Float8. (If the answer is: go back and > rebuild > > > the vector with the new type, consider the case that 100K records > > separate > > > the two above so that the first batch is long gone by the time we see > the > > > offending record. If only I could tell Drill to use Float8 (or Decimal) > > up > > > front... > > > > > > Views won't help here because the failure occurs before a view can kick > > > in. However, presumably, I could write a view to handle a different > > classic > > > case: > > > > > > myDir / > > > |- File 1: {a: 10, b: "foo"} > > > |- File 2: {a: 20} > > > > > > With query: SELECT a, b FROM myDir > > > > > > For File 2, Drill will guess that b is a Nullable Int, but it is really > > > VarChar. I think I could write clever SQL that says: > > > > > > If b is of type Nullable Int, return NULL cast to nullable VarChar, > else > > > return b > > > > > > The irony is that I must to write procedural code to declare a static > > > attribute of the data. Yet SQL is otherwise declarative: I state what I > > > want, not how to implement it. > > > > > > Life would be so much easier if I could just say, "trust me, when you > > read > > > column b, it is a VarChar." > > > > > > Thanks, > > > - Paul > > > > > > > > > > > > On Tuesday, April 3, 2018, 10:53:27 AM PDT, Ted Dunning < > > > ted.dunn...@gmail.com> wrote: > > > > > > I don't see why you say that Drill is schema-forbidden. > > > > > > The Parquet reader, for instance, makes strong use of the implied > schema > > to > > > facilitate reading of typed data. > > > > > > Likewise, the vectorized internal format is strongly typed and, as > such, > > > uses schema information. > > > > > > Views are another way to communicate schema information. > > > > > > It is true that you can't, say, view comments on fields from the > command > > > line. But I don't understand saying "schema-forbidden". > > > > > > > > > On Tue, Apr 3, 2018 at 10:01 AM, Paul Rogers <par0...@yahoo.com.invalid > > > > > wrote: > > > > > > > Here is another way to think about it. Today, Drill is > > > "schema-forbidden": > > > > even if I know the schema, I can't communicate that to Drill; Drill > > must > > > > figure it out on its own, making the same mistakes every time on > > > ambiguous > > > > schemas. > > > > > > > > > >