mbeckerle commented on PR #2836: URL: https://github.com/apache/drill/pull/2836#issuecomment-1878896878
> @mbeckerle I had a thought about your TODO list. See inline. > > > This is ready for a next review. All the scalar types are now implemented with typed setter calls. > > The prior review comments have all been addressed I believe. > > Remaining things to do include: > > > > 1. How to get the compiled DFDL schema object so it can be loaded by daffodil out at the distributed Drill nodes. > > I was thinking about this and I remembered something that might be useful. Drill has support for User Defined Functions (UDF) which are written in Java. To add a UDF to Drill, you also have to write some Java classes in a particular way, and include the JARs. Much like the DFDL class files, the UDF JARs must be accessible to all nodes of a Drill cluster. > > Additionally, Drill has the capability of adding UDFs dynamically. This feature was added here: #574. Anyway, I wonder if we could use a similar mechanism to load and store the DFDL files so that they are accessible to all Drill nodes. What do you think? Excellent: So drill has all the machinery, it's just a question of repackaging it so it's available for this usage pattern, which is a bit different from Drill's UDFs, but also very similar. There are two user scenarios which we can call production and test. 1. Production: binary compiled DFDL schema file + code jars for Daffodil's own UDFs and "layers" plugins. This should, ideally, cache the compiled schema and not reload it for every query (at every node), but keep the same loaded instance in memory in a persistant JVM image on each node. For large production DFDL schemas this is the only sensible mechanism as it can take minutes to compile large DFDL schemas. 2. Test: on-the-fly centralized compilation of DFDL schema (from a combination of jars and files) to create and cache (to avoid recompiling) the binary compiled DFDL schema file. Then using that compiled binary file, as item 1. For small DFDL schemas this can be fast enough for production use. Ideally, if the DFDL schema is unchanged this would reuse the compiled binary file, but that's an optimization that may not matter much. Kinds of objects involved are: - Daffodil plugin code jars - DFDL schema jars - DFDL schema files (just not packaged into a jar) - Daffodil compiled schema binary file - Daffodil config file - parameters, tunables, and options needed at compile time and/or runtime Code jars: Daffodil provides two extension features for DFDL users - DFDL UDFs and DFDL 'layers' (ex: plug-ins for uudecode, or gunzip algorithms used in part of the data format). Those are ordinary compiled class files in jars, so in all scenarios those jars are needed on the node class path if the DFDL schema uses them. Daffodil dynamically finds and loads these from the classpath in regular Java Service-Provider Interface (SPI) mechanisms. Schema jars: Daffodil packages DFDL schema files (source files i.e., mySchema.dfdl.xsd) into jar files to allow inter-schema dependencies to be managed using ordinary jar/java-style managed dependencies. Tools like sbt and maven can express the dependencies of one schema on another, grab and pull them together, etc. Daffodil has a resolver so when one schema file referenes another with include/import it searches the class path directories and jars for the files. Schema jars are only needed centrally when compiling the schema to a binary file. All references to the jar files for inter-schema file references are compiled into the compiled binary file. It is possible for one DFDL schema 'project' to define a DFDL schema, along with the code for a plugin like a Daffodil UDF or layer. In that case the one jar created is both a code jar and a schema jar. The schema jar aspects are used when the schema is compiled and ignored at Daffodil runtime. The code jar aspects are used at Daffodil run time and ignored at schema compilation time. So such a jar that is both code and schema jar needs to be on the class path in both places, but there's no interaction of the two things. Binary Compiled Schema File: Centrally, DFDL schemas in files and/or jars are compiled to create a single binary object which can be reloaded in order to actually use the schema to parse/unparse data. - These binary files are tied to a specific version+build of Daffodil. (They are just a java object serialization of the runtime data structures used by Daffodil). - Once reloaded into a JVM to create a Daffodil DataProcessor object, that object is read-only so thread safe, and can be shared by parse calls happening on many threads. Daffodil Config File: This contains settings like what warnings to suppress when compiling and/or at runtime, tunables, such as how large to allow a regex match attempt, maximum parsed data size limit, etc. This also is needed both at schema compile and at runtime, as the same file contains parameters for both DFDL schema compile time and runtime. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org