Fwd: [apache/drill] WIP: Preliminary Review on adding Daffodil to Drill (PR #2836)

2023-10-13 Thread Mike Beckerle
My PR needs input from drill developers.

Please look for TODO and FIXME in this PR and help me get to where I can
initialize this plugin.

In general I copied things from format-xml contrib, but then took ideas
from Json. I was unable to figure out how initialization works from the
Excel plugin.

The metadata bridge is here, and a stub of the data bridge - handles only
simple type "INT" right now, and of course doesn't compile yet.

https://github.com/apache/drill/pull/2836


-- Forwarded message -
From: Mike Beckerle 
Date: Fri, Oct 13, 2023 at 11:11 PM
Subject: [apache/drill] WIP: Preliminary Review on adding Daffodil to Drill
(PR #2836)
To: apache/drill 
Cc: Mike Beckerle , Your activity <
your_activ...@noreply.github.com>


DRILL-2835 : Preliminary
Review on adding Daffodil to Drill Description

New format-daffodil module created. But I need assistance with several
aspects.

Tests of creating Drill schemas from DFDL working. They're simple, but it's
showing promise.

There are major TODO/FIXME/TBDs in here. Search for FIXME, and "Then a
MIRACLE occurs..."

This does not compile yet because of the plugin system and how to
initialize things. This is the main open problem to get it to compile
without error.

Needs review by Drill-devs.
Documentation

TBD: This will require doc eventually
Testing

Needs more. This is just a preliminary design review Work-in-progress.
--
You can view, comment on, or merge this pull request online at:

  https://github.com/apache/drill/pull/2836
Commit Summary

   - 0633fdb
   

   Checkpoint on adding Daffodil to Drill

File Changes

(25 files )

   - *A* contrib/format-daffodil/.gitignore
   

   (2)
   - *A* contrib/format-daffodil/README.md
   

   (41)
   - *A* contrib/format-daffodil/pom.xml
   

   (89)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
   

   (180)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilDrillInfosetOutputter.java
   

   (105)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilFormatConfig.java
   

   (97)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilFormatPlugin.java
   

   (87)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilMessageParser.java
   

   (187)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/schema/DaffodilDataProcessorFactory.java
   

   (130)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/schema/DrillDaffodilSchemaUtils.java
   

   (107)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/schema/DrillDaffodilSchemaVisitor.java
   

   (121)
   - *A*
   contrib/format-daffodil/src/main/resources/bootstrap-format-plugins.json
   

   (26)
   - *A* contrib/format-daffodil/src/main/resources/drill-module.conf
   

   (25)
   - *A*
   
contrib/format-daffodil/src/test/java/org/apache/drill/exec/store/daffodil/TestDaffodilReader.java
   


[PR] WIP: Preliminary Review on adding Daffodil to Drill (drill)

2023-10-13 Thread via GitHub


mbeckerle opened a new pull request, #2836:
URL: https://github.com/apache/drill/pull/2836

   # [DRILL-2835](https://issues.apache.org/jira/browse/DRILL-2835): 
Preliminary Review on adding Daffodil to Drill
   
   ## Description
   
   New format-daffodil module created. But I need assistance with several 
aspects. 
   
   Tests of creating Drill schemas from DFDL working. They're simple, but it's 
showing promise.
   
   There are major TODO/FIXME/TBDs in here. Search for FIXME, and "Then a 
MIRACLE occurs..."
   
   This does not compile yet because of the plugin system and how to initialize 
things. This is the main open problem to get it to compile without error.
   
   Needs review by Drill-devs.
   
   ## Documentation
   
   TBD: This will require doc eventually
   
   ## Testing
   
   Needs more. This is just a preliminary design review Work-in-progress.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-13 Thread Mike Beckerle
Very helpful.

Answers to your questions, and comments are below:

On Thu, Oct 12, 2023 at 5:14 PM Charles Givre  wrote:

> HI Mike,
> I hope all is well.  I'll take a stab at answering your questions.  But I
> have a few questions as well:
>
>
1.  Are you writing a storage or format plugin for DFDL?  My thinking was
> that this would be a format plugin, but let me know if you were thinking
> differently
>

Format plugin.


> 2.  In traditional deployments, where do people store the DFDL schemata
> files?  Are they local or accessible via URL?
>

Schemas are stored in files, or in jar files created when packaging a
schema project. Hence URI is the preferred identifier for them.  They are
not retrieved remotely or anything like that. It's a matter of whether they
are in jars on the classpath, directories on the classpath, or just a file
location.

The source-code of DFDL schemas are often created using other schemas as
components, so a single "DFDL schema" may have parts that come from 5 jar
files on the classpath e.g., 2 different header schemas, a library schema,
and the "main" schema that assembles them all.  Inside schemas they refer
to each other via xs:include or xs:import, and the schemaLocation attribute
takes a URI to the location of the included/imported schema and those URIs
are interpreted this same way we would want Drill to identify the location
of a schema.

However, really people will want to pre-compile any real non-toy/test DFDL
schemas into binary ".bin" files for faster loading. Otherwise Daffodil
schema compilation time can be excessive (minutes for large DFDL schemas -
for example the DFDL schema for VMF is 180K lines of DFDL). Compiled
schemas live in exactly 1 file (relatively small. The compiled form of VMF
schema is 8Mbytes). So the path given for schema in Drill sql query, or in
the config wants to be allowed to be either a compiled schema or a
source-code schema (.xsd) this latter mostly being for test, training, and
toy examples that we would compile on-the-fly.


> To get the DFDL schema file or URL we have a few options, all of which
> revolve around setting a config variable.  For now, let's just say that the
> schema file is contained in the same folder as the data.  (We can make this
> more sophisticated later...)
>

It would make life difficult if the schemas and test data must be
co-resident. Most schema projects have these in entirely separate
sub-trees. Schema will be under src/main/resources//xsd, compiled
schema would be under target/... and test data under
src/test/resources/.../data

For now I think the easiest thing is just we get two URIs. One is for the
data, one is for the schema. We access them via getClass().getResource().

We should not worry about caching or anything for now. Once the above works
for a decent scope of tests we can worry about making it more convenient to
have a library of schemas at one's disposal.


>
> Here's what you have to do.
>
> 1.  In the formatConfig file, define a String called 'dfdlSchema'.
> Note... config variables must be private and final.  If they aren't it can
> cause weird errors that are really difficult to debug.  For some reference,
> take a look at the Excel plugin.  (
> https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java
> )
>
> Setting a config variable there will allow a user to set a global schema
> definition.  This can also be configured individually for various
> workspaces.  So let's say you had PCAP files in one workspace, you could
> globally set the DFDL file for that and then another workspace which has
> some other file, you could create another DFDL plugin instance for that.
>

Ok, so the above lets me play with Drill and one schema by default. Ok for
using Drill to explore data, and useful for testing.


>
> Now, this is all fine and good, but a user might also want to define the
> schema file at query time.  The good news is that Drill allows you to do
> that via the table() function.
>
>
This would allow real data-integration queries against multiple different
DFDL-described data sources. Needed for a compelling demo.


> So let's say that we want to use a different schema file than the default,
> we could do something like this:
>
> SELECT 
> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl',
> dfdlSchema=>'path_to_schema')
>
> Take a look at the Excel docs (
> https://github.com/apache/drill/blob/master/contrib/format-excel/README.md)
> which demonstrate how to write queries like that.  I believe that the
> parameters in the table function take higher precedence than the parameters
> from the config.  That would make sense at least.
>
>
Perfect. I'll start with this.


>
> 2.  Now that we have the schema file, the next thing would be to convert
> that into a Drill schema.  Let's say that we have a function called
> dfdlToDrill that handles the conversion.
>
> What you'd have to do is in the constructor for t