Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Charles Givre Wed, 18 Oct 2023 07:50:50 -0700

Hi Mike, 
I hope all is well.  I remembered one other piece which might be useful for 
you.  Drill has an interface called a PersistentStore which is used for storing 
artifacts such as tokens etc.  I've uesd it on two occasions: in the 
GoogleSheets plugin and the Http plugin.  In both cases, I used it to store 
OAuth user tokens which need to be preserved and shared across drillbits, and 
also frequently updated.  I was thinking that this might be useful for caching 
the DFDL schemata.  If you take a look here: 
https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java,
 
https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth.
 and here 
https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java,
 you can see how I used that.


Best,
-- C

  




> On Oct 13, 2023, at 1:25 PM, Mike Beckerle <mbecke...@apache.org> wrote:
> 
> Very helpful.
> 
> Answers to your questions, and comments are below:
> 
> On Thu, Oct 12, 2023 at 5:14 PM Charles Givre <cgi...@gmail.com 
> <mailto:cgi...@gmail.com>> wrote:
>> HI Mike, 
>> I hope all is well.  I'll take a stab at answering your questions.  But I 
>> have a few questions as well:
>>  
>> 1.  Are you writing a storage or format plugin for DFDL?  My thinking was 
>> that this would be a format plugin, but let me know if you were thinking 
>> differently
> 
> Format plugin.
>  
>> 2.  In traditional deployments, where do people store the DFDL schemata 
>> files?  Are they local or accessible via URL?
> 
> Schemas are stored in files, or in jar files created when packaging a schema 
> project. Hence URI is the preferred identifier for them.  They are not 
> retrieved remotely or anything like that. It's a matter of whether they are 
> in jars on the classpath, directories on the classpath, or just a file 
> location. 
> 
> The source-code of DFDL schemas are often created using other schemas as 
> components, so a single "DFDL schema" may have parts that come from 5 jar 
> files on the classpath e.g., 2 different header schemas, a library schema, 
> and the "main" schema that assembles them all.  Inside schemas they refer to 
> each other via xs:include or xs:import, and the schemaLocation attribute 
> takes a URI to the location of the included/imported schema and those URIs 
> are interpreted this same way we would want Drill to identify the location of 
> a schema. 
> 
> However, really people will want to pre-compile any real non-toy/test DFDL 
> schemas into binary ".bin" files for faster loading. Otherwise Daffodil 
> schema compilation time can be excessive (minutes for large DFDL schemas - 
> for example the DFDL schema for VMF is 180K lines of DFDL). Compiled schemas 
> live in exactly 1 file (relatively small. The compiled form of VMF schema is 
> 8Mbytes). So the path given for schema in Drill sql query, or in the config 
> wants to be allowed to be either a compiled schema or a source-code schema 
> (.xsd) this latter mostly being for test, training, and toy examples that we 
> would compile on-the-fly.  
>  
>> To get the DFDL schema file or URL we have a few options, all of which 
>> revolve around setting a config variable.  For now, let's just say that the 
>> schema file is contained in the same folder as the data.  (We can make this 
>> more sophisticated later...)
> 
> It would make life difficult if the schemas and test data must be 
> co-resident. Most schema projects have these in entirely separate sub-trees. 
> Schema will be under src/main/resources/..../xsd, compiled schema would be 
> under target/... and test data under src/test/resources/.../data
> 
> For now I think the easiest thing is just we get two URIs. One is for the 
> data, one is for the schema. We access them via getClass().getResource(). 
> 
> We should not worry about caching or anything for now. Once the above works 
> for a decent scope of tests we can worry about making it more convenient to 
> have a library of schemas at one's disposal. 
>  
>> 
>> Here's what you have to do.
>> 
>> 1.  In the formatConfig file, define a String called 'dfdlSchema'.   Note... 
>> config variables must be private and final.  If they aren't it can cause 
>> weird errors that are really difficult to debug.  For some reference, take a 
>> look at the Excel plugin.  
>> (https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java)
>> 
>> Setting a config variable there will allow a user to set a global schema 
>> definition.  This can also be configured individually for various 
>> workspaces.  So let's say you had PCAP files in one workspace, you could 
>> globally set the DFDL file for that and then another workspace which has 
>> some other file, you could create another DFDL plugin instance for that. 
> 
> Ok, so the above lets me play with Drill and one schema by default. Ok for 
> using Drill to explore data, and useful for testing. 
>  
>> 
>> Now, this is all fine and good, but a user might also want to define the 
>> schema file at query time.  The good news is that Drill allows you to do 
>> that via the table() function. 
>> 
> 
> This would allow real data-integration queries against multiple different 
> DFDL-described data sources. Needed for a compelling demo. 
>  
>> So let's say that we want to use a different schema file than the default, 
>> we could do something like this:
>> 
>> SELECT ....
>> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl', 
>> dfdlSchema=>'path_to_schema')
>> 
>> Take a look at the Excel docs 
>> (https://github.com/apache/drill/blob/master/contrib/format-excel/README.md) 
>> which demonstrate how to write queries like that.  I believe that the 
>> parameters in the table function take higher precedence than the parameters 
>> from the config.  That would make sense at least.
>> 
> 
> Perfect. I'll start with this. 
>  
>> 
>> 2.  Now that we have the schema file, the next thing would be to convert 
>> that into a Drill schema.  Let's say that we have a function called 
>> dfdlToDrill that handles the conversion.
>> 
>> What you'd have to do is in the constructor for the BatchReader, you'd have 
>> to set the schema there.  So pseudo code:
>> 
>> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan, 
>> FileSchemaNegotiator negotiator) {
>>      // Other stuff...
>>      
>>      // Get Drill schema from DFDL
>>      TupleMetadata schema = dfldToDrill(<dfdl schema file);
>>      
>>      // Here's the important part
>>      negotiator.tableSchema(schema, true);
>> }
>> 
>> The negotiator.tableSchema() accepts two args, a TupleMetadata and a boolean 
>> as to whether the schema is final or not.  Once this schema has been added 
>> to the negotiator object, you can then create the writers. 
>> 
> 
> That negotiator.tableSchema() is ideal. I was hoping that this was going to 
> be the only place the metadata had to be given to drill. Excellent. 
>  
>> 
>> Take a look here... 
>> 
>> 
>> drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
>>  at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill
>> github.com
>>  
>> <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
>>  at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill 
>> <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>
>> github.com 
>> <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>
>> 
>> 
>> I see Paul just responded so I'll leave you with this.  If you have 
>> additional questions, send them our way.  Do take a look at the Excel plugin 
>> as I think it will be helpful.
>> 
> Yes, I've found the JsonLoaderImpl.readBatch() method, and Daffodil can work 
> similarly.
> 
> This will take me a few more days to get to a pull request. The first one 
> will be initial review, i.e., not intended to merge without more tests. 
> Probably it will support only integer data fields, but should support lots of 
> data shapes including vectors, choices, sequences, nested records, etc. 
> 
> Thanks for the help. 
>  
>> 
>>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mbecke...@apache.org 
>>> <mailto:mbecke...@apache.org>> wrote:
>>> 
>>> So when a data format is described by a DFDL schema, I can generate
>>> equivalent Drill schema (TupleMetadata). This schema is always complete. I
>>> have unit tests working with this.
>>> 
>>> To do this for a real SQL query, I need the DFDL schema to be identified on
>>> the SQL query by a file path or URI.
>>> 
>>> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
>>> 
>>> Next, assuming I have the DFDL schema identified, I generate an equivalent
>>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
>>> 
>>> What objects do I call, or what classes do I have to create to make this
>>> Drill TupleMetadata available to Drill so it uses it in all the ways a
>>> static Drill schema can be useful?
>>> 
>>> I just need pointers to the code that illustrate how to do this. Thanks
>>> 
>>> -Mike Beckerle
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <par0...@gmail.com 
>>> <mailto:par0...@gmail.com>> wrote:
>>> 
>>>> Mike,
>>>> 
>>>> This is a complex question and has two answers.
>>>> 
>>>> First, the standard enhanced vector framework (EVF) used by most readers
>>>> assumes a "pull" model: read each record. This is where the next() comes
>>>> in: readers just implement this to read the next record. But, the code
>>>> under EVF works with a push model: the readers write to vectors, and signal
>>>> the next record. EVF translates the lower-level push model to the
>>>> higher-level, easier-to-use pull model. The best example of this is the
>>>> JSON reader which uses Jackson to parse JSON and responds to the
>>>> corresponding events.
>>>> 
>>>> You can thus take over the task of filling a batch of records. I'd have to
>>>> poke around the code to refresh my memory. Or, you can take a look at the
>>>> (quite complex) JSON parser, or the EVF itself to see what it does. There
>>>> are many unit tests that show this at various levels of abstraction.
>>>> 
>>>> Basically, you have to:
>>>> 
>>>> * Start a batch
>>>> * Ask if you can start the next record (which might be declined if the
>>>> batch is full)
>>>> * Write each field. For complex fields, such as records, recursively do the
>>>> start/end record work.
>>>> * Mark the record as complete.
>>>> 
>>>> You should be able to map event handlers to EVF actions as a result. Even
>>>> though DFDL wants to "drive", it still has to give up control once the
>>>> batch is full. EVF will then handle the (surprisingly complex) task of
>>>> finishing up the batch and returning it as the output of the Scan operator.
>>>> 
>>>> - Paul
>>>> 
>>>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mbecke...@apache.org 
>>>> <mailto:mbecke...@apache.org>>
>>>> wrote:
>>>> 
>>>>> Daffodil parsing generates event callbacks to an InfosetOutputter, which
>>>> is
>>>>> analogous to a SAX event handler.
>>>>> 
>>>>> Drill is expecting an iterator style of calling next() to advance through
>>>>> the input, i.e., Drill has the control thread and expects to do pull
>>>>> parsing. At least from the code I studied in the format-xml contrib.
>>>>> 
>>>>> Is there any alternative? Before I dig into creating another one of these
>>>>> co-routine-style control inversions (which have proven to be problematic
>>>>> for performance.

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Reply via email to