cgivre commented on PR #2989:
URL: https://github.com/apache/drill/pull/2989#issuecomment-3824439816

   Hi Mike, 
   How’s it going?  I hope all is well.  It’s been a while since we spoke and 
I’d really like to wrap up the Drill / Daffodil work and get it merged as I’d 
like to cut a new release in the next month or so.  
   
   The way I understood the way Daffodil works was that to use it, there were 
various files which Daffodil needs to understand the schema.  For Drill’s 
purpose, the type of file isn’t really important, but we need all these files 
to be on every Drill node and we need them to be in the Classpath.  Ideally, we 
need a way for a user to upload these files such that they are distributed to 
all the nodes in the Drill cluster.
   
   My thinking here was that Drill UDFs do basically the same thing with the 
CREATE FUNCTION USING JAR… syntax.  When a user executes a query like this, the 
JAR file is moved from a staging folder on a single Drillbit to the appropriate 
location on all the Drillbits in the cluster.  I was thinking that we do the 
same thing for Daffodil.  In this case, the syntax might be wonky, but if 
Daffodil requires a bin file, all the user would have to do is upload it to the 
staging directory on a single node and execute a query:
   
   CREATE DAFFODIL SCHEMA USING JAR ‘my-file.bin';
   If there are additional files, the user could do the same:
   CREATE DAFFODIL SCHEMA USING JAR ‘my-other-file.jar';
   In the current implementation, Drill doesn’t check the file types, so all 
that is happening here is that the user is uploading a file and Drill is 
distributing it to the cluster.  So whether a Daffodil user creates a compiled 
BIN or uses a collection of JAR files, they can use the same mechanism to get 
them onto the Drill cluster.  (We could modify the query to allow:
   CREATE DAFFODIL SCHEMA USING BIN ‘my bin.bin’; 
   
   But in any event, all this is doing is upload a file and distributing it to 
the cluster.  Would this work for you?
   Thanks!
   — C
   
   
   
   > On Nov 5, 2025, at 16:41, Mike Beckerle ***@***.***> wrote:
   > 
   > 
   > mbeckerle
   >  left a comment 
   > (apache/drill#2989)
   >  <https://github.com/apache/drill/pull/2989#issuecomment-3493623186>
   > Ok, If I specify an actual jar file containing some compiled java code, 
will that be put onto the java classpath in the drill bits?
   > 
   > The issue I'm seeing is that schemas are normally pre-compiled into a 
".bin" file which is fast to load, but in addition to this file, the schema may 
have a dependency on certain Daffodil plug in code, which is compiled java in 
jar files. This dependency can be on multiple different jar files. All these 
dependency jar files need to be on the classpath.
   > 
   > The daffodil plugins are of 3 kinds. UDFs, "layers" (which compute 
checksums or decompress zip files, etc. ), and charset definitions. All are 
dynamically loaded into the JVM when the DFDL schema requests them. They are 
found using the
   > 
   > All these different jar files need to be on the Java classpath so that 
their metadata allows dynamic loading.
   > 
   > So while a simple DFDL schema might be contained in one jar file, in 
general there can be a dependency on multiple jar files which must be placed 
onto the Java classpath in a specific order. The schema may be needed in source 
form also for validation of data.
   > 
   > As a case in point, on github there are DFDL schema projects named:
   > 
   > envelope-payload
   > tcpMessage
   > mil-std-2045
   > PCAP
   > ethernetIP
   > These are separate component DFDL schemas that are assembled to form an 
assembly schema by way of schema composition.
   > The only jar file that needs to be on the classpath is the one from 
ethernetIP, since that defines a layer algorithm for computing IPv4 checksums.
   > 
   > The DFDL schema that combines all these components can be pre-compiled 
into an envelope-payload.bin file.
   > 
   > So in this case I need this ".bin" file to be distributed across the 
cluster and loaded by Daffodil in each drill bit, and with the ethernetIP.jar 
file distributed across the drill cluster and the ethernetIP.jar needs to be on 
the classpath of the drill bit java process.
   > 
   > —
   > Reply to this email directly, view it on GitHub 
<https://github.com/apache/drill/pull/2989#issuecomment-3493623186>, or 
unsubscribe 
<https://github.com/notifications/unsubscribe-auth/ABKB7PRW5VTVSJDDMJPE2ST33JVILAVCNFSM6AAAAAB4XWHMGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIOJTGYZDGMJYGY>.
   > You are receiving this because you were assigned.
   > 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to