mbeckerle commented on PR #2836:
URL: https://github.com/apache/drill/pull/2836#issuecomment-1878896878

   > @mbeckerle I had a thought about your TODO list. See inline.
   > 
   > > This is ready for a next review. All the scalar types are now 
implemented with typed setter calls.
   > > The prior review comments have all been addressed I believe.
   > > Remaining things to do include:
   > > 
   > > 1. How to get the compiled DFDL schema object so it can be loaded by 
daffodil out at the distributed Drill nodes.
   > 
   > I was thinking about this and I remembered something that might be useful. 
Drill has support for User Defined Functions (UDF) which are written in Java. 
To add a UDF to Drill, you also have to write some Java classes in a particular 
way, and include the JARs. Much like the DFDL class files, the UDF JARs must be 
accessible to all nodes of a Drill cluster.
   > 
   > Additionally, Drill has the capability of adding UDFs dynamically. This 
feature was added here: #574. Anyway, I wonder if we could use a similar 
mechanism to load and store the DFDL files so that they are accessible to all 
Drill nodes. What do you think?
   
   Excellent: So drill has all the machinery, it's just a question of 
repackaging it so it's available for this usage pattern, which is a bit 
different from Drill's UDFs, but also very similar. 
   
   There are two user scenarios which we can call production and test.
   
   1. Production: binary compiled DFDL schema file + code jars for Daffodil's 
own UDFs and "layers" plugins. This should, ideally, cache the compiled schema 
and not reload it for every query (at every node), but keep the same loaded 
instance in memory in a persistant JVM image on each node. For large production 
DFDL schemas this is the only sensible mechanism as it can take minutes to 
compile large DFDL schemas. 
   
   2. Test: on-the-fly centralized compilation of DFDL schema (from a 
combination of jars and files) to create and cache (to avoid recompiling) the 
binary compiled DFDL schema file. Then using that compiled binary file, as item 
1. For small DFDL schemas this can be fast enough for production use. Ideally, 
if the DFDL schema is unchanged this would reuse the compiled binary file, but 
that's an optimization that may not matter much. 
   
   Kinds of objects involved are:
   
   - Daffodil plugin code jars
   - DFDL schema jars
   - DFDL schema files (just not packaged into a jar)
   - Daffodil compiled schema binary file
   - Daffodil config file - parameters, tunables, and options needed at compile 
time and/or runtime
   
   Code jars: Daffodil provides two extension features for DFDL users - DFDL 
UDFs and DFDL 'layers' (ex: plug-ins for uudecode, or gunzip algorithms used in 
part of the data format). Those are ordinary compiled class files in jars, so 
in all scenarios those jars are needed on the node class path if the DFDL 
schema uses them. Daffodil dynamically finds and loads these from the classpath 
in regular Java Service-Provider Interface (SPI) mechanisms. 
   
   Schema jars: Daffodil packages DFDL schema files (source files i.e., 
mySchema.dfdl.xsd) into jar files to allow inter-schema dependencies to be 
managed using ordinary jar/java-style managed dependencies. Tools like sbt and 
maven can express the dependencies of one schema on another, grab and pull them 
together, etc. Daffodil has a resolver so when one schema file referenes 
another with include/import it searches the class path directories and jars for 
the files. 
   
   Schema jars are only needed centrally when compiling the schema to a binary 
file. All references to the jar files for inter-schema file references are 
compiled into the compiled binary file. 
   
   It is possible for one DFDL schema 'project' to define a DFDL schema, along 
with the code for a plugin like a Daffodil UDF or layer. In that case the one 
jar created is both a code jar and a schema jar. The schema jar aspects are 
used when the schema is compiled and ignored at Daffodil runtime. The code jar 
aspects are used at Daffodil run time and ignored at schema compilation time. 
So such a jar that is both code and schema jar needs to be on the class path in 
both places, but there's no interaction of the two things. 
   
   Binary Compiled Schema File: Centrally, DFDL schemas in files and/or jars 
are compiled to create a single binary object which can be reloaded in order to 
actually use the schema to parse/unparse data. 
   
   - These binary files are tied to a specific version+build of Daffodil. (They 
are just a java object serialization of the runtime data structures used by 
Daffodil). 
   - Once reloaded into a JVM to create a Daffodil DataProcessor object, that 
object is read-only so thread safe, and can be shared by parse calls happening 
on many threads. 
   
   Daffodil Config File: This contains settings like what warnings to suppress 
when compiling and/or at runtime, tunables, such as how large to allow a regex 
match attempt, maximum parsed data size limit, etc. This also is needed both at 
schema compile and at runtime, as the same file contains parameters for both 
DFDL schema compile time and runtime.  
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to