Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Paul Rogers Wed, 18 Oct 2023 11:12:06 -0700

Hi Charles,

The persistent store is just ZooKeeper, and ZK is known to work poorly as a
distributed DB. ZK works great for things like tokens, node registrations
and the like. But, ZK scales very poorly for things like schemas (or query
profiles or a list of active queries.)


A more scalable approach may be to cache the schemas in each Drillbit, then
translate them to Drill's format and include them in each Scan operator
definition sent to each execution Drillbit. That solution avoids race
conditions when the schemas change while a query is in flight. This is, in
fact, the model used for storage plugin definitions. (The storage plugin
definitions are, in fact, stored in ZK, but tend to be small and few in
number.)

- Paul


On Wed, Oct 18, 2023 at 7:51 AM Charles Givre <cgi...@gmail.com> wrote:

> Hi Mike,
> I hope all is well.  I remembered one other piece which might be useful
> for you.  Drill has an interface called a PersistentStore which is used for
> storing artifacts such as tokens etc.  I've uesd it on two occasions: in
> the GoogleSheets plugin and the Http plugin.  In both cases, I used it to
> store OAuth user tokens which need to be preserved and shared across
> drillbits, and also frequently updated.  I was thinking that this might be
> useful for caching the DFDL schemata.  If you take a look here:
> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java,
>
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth.
> and here
> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java,
> you can see how I used that.
>
> Best,
> -- C
>
>
>
>
>
>
> > On Oct 13, 2023, at 1:25 PM, Mike Beckerle <mbecke...@apache.org> wrote:
> >
> > Very helpful.
> >
> > Answers to your questions, and comments are below:
> >
> > On Thu, Oct 12, 2023 at 5:14 PM Charles Givre <cgi...@gmail.com <mailto:
> cgi...@gmail.com>> wrote:
> >> HI Mike,
> >> I hope all is well.  I'll take a stab at answering your questions.  But
> I have a few questions as well:
> >>
> >> 1.  Are you writing a storage or format plugin for DFDL?  My thinking
> was that this would be a format plugin, but let me know if you were
> thinking differently
> >
> > Format plugin.
> >
> >> 2.  In traditional deployments, where do people store the DFDL schemata
> files?  Are they local or accessible via URL?
> >
> > Schemas are stored in files, or in jar files created when packaging a
> schema project. Hence URI is the preferred identifier for them.  They are
> not retrieved remotely or anything like that. It's a matter of whether they
> are in jars on the classpath, directories on the classpath, or just a file
> location.
> >
> > The source-code of DFDL schemas are often created using other schemas as
> components, so a single "DFDL schema" may have parts that come from 5 jar
> files on the classpath e.g., 2 different header schemas, a library schema,
> and the "main" schema that assembles them all.  Inside schemas they refer
> to each other via xs:include or xs:import, and the schemaLocation attribute
> takes a URI to the location of the included/imported schema and those URIs
> are interpreted this same way we would want Drill to identify the location
> of a schema.
> >
> > However, really people will want to pre-compile any real non-toy/test
> DFDL schemas into binary ".bin" files for faster loading. Otherwise
> Daffodil schema compilation time can be excessive (minutes for large DFDL
> schemas - for example the DFDL schema for VMF is 180K lines of DFDL).
> Compiled schemas live in exactly 1 file (relatively small. The compiled
> form of VMF schema is 8Mbytes). So the path given for schema in Drill sql
> query, or in the config wants to be allowed to be either a compiled schema
> or a source-code schema (.xsd) this latter mostly being for test, training,
> and toy examples that we would compile on-the-fly.
> >
> >> To get the DFDL schema file or URL we have a few options, all of which
> revolve around setting a config variable.  For now, let's just say that the
> schema file is contained in the same folder as the data.  (We can make this
> more sophisticated later...)
> >
> > It would make life difficult if the schemas and test data must be
> co-resident. Most schema projects have these in entirely separate
> sub-trees. Schema will be under src/main/resources/..../xsd, compiled
> schema would be under target/... and test data under
> src/test/resources/.../data
> >
> > For now I think the easiest thing is just we get two URIs. One is for
> the data, one is for the schema. We access them via
> getClass().getResource().
> >
> > We should not worry about caching or anything for now. Once the above
> works for a decent scope of tests we can worry about making it more
> convenient to have a library of schemas at one's disposal.
> >
> >>
> >> Here's what you have to do.
> >>
> >> 1.  In the formatConfig file, define a String called 'dfdlSchema'.
>  Note... config variables must be private and final.  If they aren't it can
> cause weird errors that are really difficult to debug.  For some reference,
> take a look at the Excel plugin.  (
> https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java
> )
> >>
> >> Setting a config variable there will allow a user to set a global
> schema definition.  This can also be configured individually for various
> workspaces.  So let's say you had PCAP files in one workspace, you could
> globally set the DFDL file for that and then another workspace which has
> some other file, you could create another DFDL plugin instance for that.
> >
> > Ok, so the above lets me play with Drill and one schema by default. Ok
> for using Drill to explore data, and useful for testing.
> >
> >>
> >> Now, this is all fine and good, but a user might also want to define
> the schema file at query time.  The good news is that Drill allows you to
> do that via the table() function.
> >>
> >
> > This would allow real data-integration queries against multiple
> different DFDL-described data sources. Needed for a compelling demo.
> >
> >> So let's say that we want to use a different schema file than the
> default, we could do something like this:
> >>
> >> SELECT ....
> >> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl',
> dfdlSchema=>'path_to_schema')
> >>
> >> Take a look at the Excel docs (
> https://github.com/apache/drill/blob/master/contrib/format-excel/README.md)
> which demonstrate how to write queries like that.  I believe that the
> parameters in the table function take higher precedence than the parameters
> from the config.  That would make sense at least.
> >>
> >
> > Perfect. I'll start with this.
> >
> >>
> >> 2.  Now that we have the schema file, the next thing would be to
> convert that into a Drill schema.  Let's say that we have a function called
> dfdlToDrill that handles the conversion.
> >>
> >> What you'd have to do is in the constructor for the BatchReader, you'd
> have to set the schema there.  So pseudo code:
> >>
> >> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan,
> FileSchemaNegotiator negotiator) {
> >>      // Other stuff...
> >>
> >>      // Get Drill schema from DFDL
> >>      TupleMetadata schema = dfldToDrill(<dfdl schema file);
> >>
> >>      // Here's the important part
> >>      negotiator.tableSchema(schema, true);
> >> }
> >>
> >> The negotiator.tableSchema() accepts two args, a TupleMetadata and a
> boolean as to whether the schema is final or not.  Once this schema has
> been added to the negotiator object, you can then create the writers.
> >>
> >
> > That negotiator.tableSchema() is ideal. I was hoping that this was going
> to be the only place the metadata had to be given to drill. Excellent.
> >
> >>
> >> Take a look here...
> >>
> >>
> >>
> drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill
> >> github.com
> >>  <
> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill <
> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
> >
> >> github.com <
> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
> >
> >>
> >>
> >> I see Paul just responded so I'll leave you with this.  If you have
> additional questions, send them our way.  Do take a look at the Excel
> plugin as I think it will be helpful.
> >>
> > Yes, I've found the JsonLoaderImpl.readBatch() method, and Daffodil can
> work similarly.
> >
> > This will take me a few more days to get to a pull request. The first
> one will be initial review, i.e., not intended to merge without more tests.
> Probably it will support only integer data fields, but should support lots
> of data shapes including vectors, choices, sequences, nested records, etc.
> >
> > Thanks for the help.
> >
> >>
> >>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mbecke...@apache.org
> <mailto:mbecke...@apache.org>> wrote:
> >>>
> >>> So when a data format is described by a DFDL schema, I can generate
> >>> equivalent Drill schema (TupleMetadata). This schema is always
> complete. I
> >>> have unit tests working with this.
> >>>
> >>> To do this for a real SQL query, I need the DFDL schema to be
> identified on
> >>> the SQL query by a file path or URI.
> >>>
> >>> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
> >>>
> >>> Next, assuming I have the DFDL schema identified, I generate an
> equivalent
> >>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
> >>>
> >>> What objects do I call, or what classes do I have to create to make
> this
> >>> Drill TupleMetadata available to Drill so it uses it in all the ways a
> >>> static Drill schema can be useful?
> >>>
> >>> I just need pointers to the code that illustrate how to do this. Thanks
> >>>
> >>> -Mike Beckerle
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <par0...@gmail.com
> <mailto:par0...@gmail.com>> wrote:
> >>>
> >>>> Mike,
> >>>>
> >>>> This is a complex question and has two answers.
> >>>>
> >>>> First, the standard enhanced vector framework (EVF) used by most
> readers
> >>>> assumes a "pull" model: read each record. This is where the next()
> comes
> >>>> in: readers just implement this to read the next record. But, the code
> >>>> under EVF works with a push model: the readers write to vectors, and
> signal
> >>>> the next record. EVF translates the lower-level push model to the
> >>>> higher-level, easier-to-use pull model. The best example of this is
> the
> >>>> JSON reader which uses Jackson to parse JSON and responds to the
> >>>> corresponding events.
> >>>>
> >>>> You can thus take over the task of filling a batch of records. I'd
> have to
> >>>> poke around the code to refresh my memory. Or, you can take a look at
> the
> >>>> (quite complex) JSON parser, or the EVF itself to see what it does.
> There
> >>>> are many unit tests that show this at various levels of abstraction.
> >>>>
> >>>> Basically, you have to:
> >>>>
> >>>> * Start a batch
> >>>> * Ask if you can start the next record (which might be declined if the
> >>>> batch is full)
> >>>> * Write each field. For complex fields, such as records, recursively
> do the
> >>>> start/end record work.
> >>>> * Mark the record as complete.
> >>>>
> >>>> You should be able to map event handlers to EVF actions as a result.
> Even
> >>>> though DFDL wants to "drive", it still has to give up control once the
> >>>> batch is full. EVF will then handle the (surprisingly complex) task of
> >>>> finishing up the batch and returning it as the output of the Scan
> operator.
> >>>>
> >>>> - Paul
> >>>>
> >>>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mbecke...@apache.org
> <mailto:mbecke...@apache.org>>
> >>>> wrote:
> >>>>
> >>>>> Daffodil parsing generates event callbacks to an InfosetOutputter,
> which
> >>>> is
> >>>>> analogous to a SAX event handler.
> >>>>>
> >>>>> Drill is expecting an iterator style of calling next() to advance
> through
> >>>>> the input, i.e., Drill has the control thread and expects to do pull
> >>>>> parsing. At least from the code I studied in the format-xml contrib.
> >>>>>
> >>>>> Is there any alternative? Before I dig into creating another one of
> these
> >>>>> co-routine-style control inversions (which have proven to be
> problematic
> >>>>> for performance.
>
>

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Reply via email to