Re: Proposed Way Forward for Drill <> DFDL Integration

Charles Givre Mon, 07 Oct 2024 07:35:55 -0700

Thanks!

Looking forward.



> On Oct 7, 2024, at 10:31, Mike Beckerle <[email protected]> wrote:
> 
> Ok, to get what I've done so far merged, i have to rebase it on the latest
> drill commit, and the junit tests that exercise it must work.
> 
> Also update to Daffodil 3.9.0 which was just released.
> 
> This *should* be very easy, the unit tests were all working last I tried
> them.
> 
> I will try to get this done this week.
> 
> 
> On Mon, Oct 7, 2024 at 9:58 AM Charles Givre <[email protected]> wrote:
> 
>> Hi Mike,
>> Let me answer this as best I can.  Firstly, just to be clear on this
>> point, the phase 1 implementation isn’t the desired state.  It’s not really
>> all that workable, but it gets what you’ve already done merged.   Since, as
>> you mentioned, DFDL needs multiple files, what if you were to put these
>> files in the classpath in a folder?  IE:
>> 
>> Classpath/schema1/
>> Classpath/schema2/
>> 
>> For tests, I’d imagine all you have to do is copy the valid files into the
>> test/resources/ folder then run your queries.   In real life situations a
>> user would have to copy all the files into the classpath of all drill
>> nodes.  This will be dealt with in phase 2.  In phase 2, the user will
>> simply have to copy the files into a staging directory and Drill will
>> handle copying them to all nodes.  (I think)
>> 
>> Best,
>> — C
>> 
>> 
>>> On Oct 3, 2024, at 10:15, Mike Beckerle <[email protected]> wrote:
>>> 
>>> I agree we can do the phase1 merge. It should not break anything.
>>> 
>>> Phase 2 ... Paul suggested "just throw everything into
>>> $DRILL_CONFIG_DIR", plugin jars, schema jars, everything, as
>>> apparently that gets automatically copied everywhere and put on the
>>> class path.
>>> 
>>> I left off right at that point for lack of knowledge.
>>> 
>>> How would a test work that way? I.e, a maven test under
>>> src/test/java... how is it going to arrange for DRILL_CONFIG_DIR to be
>>> defined, and put things into that directory before drill executes (and
>>> reads the env for DRILL_CONFIG_DIR's value). I normally think of
>>> env-vars as frozen at the time the JVM starts, so tests can't change
>>> them unless they are forking a process, and in a complex system like
>>> drill I have no idea the implications of this.
>>> 
>>> The only logic change needed I think is to deal with "there is exactly
>>> 1 file to parse and query", vs. "there are numerous files to parse and
>>> query"  These files could, I suppose, be distributed somehow, but they
>>> also could just be a bunch of files. My guess is drill already has all
>>> of this, and we just have to reuse the pattern from some other
>>> extension.
>>> 
>>> 
>>> On Wed, Oct 2, 2024 at 9:17 AM Charles Givre <[email protected]> wrote:
>>>> 
>>>> Hi Mike,
>>>> I hope all is well.  I need to apologize as I grossly overestimated my
>> available free time to assist with the DFDL / Drill integration.  I had a
>> thought which I wanted to propose.
>>>> 
>>>> My thinking is that we should complete the integration in two phases:
>>>> 
>>>> Phase 1:
>>>> For phase 1, I propose that we merge the work that you’ve already
>> done.  We’d have to make sure that the DFDL files are accessible from the
>> class path.  This isn’t really a great solution, but it is just to get the
>> pieces in so we can work on phase 2.  I don’t like seeing good work
>> languishing in the PR queue and getting stale.  To complete phase 1, all
>> we’d really have to do is get the unit tests working.
>>>> 
>>>> Phase 2:
>>>> The remaining issue revolves around making the DFDL files accessible to
>> Drill and also so that a user can easily add or remove files.  For this we
>> have a solution: DRILL-4726[1] which provides dynamic UDF support.
>> Basically what I’m proposing is that we duplicate the components of this PR
>> for Drill.  The end result would be that a user could copy the UDF files to
>> a staging directory.  Then the user would run a command like:
>>>> 
>>>> CREATE DAFFODIL SCHEMA xxxx USING JAR yyyyy
>>>> 
>>>> When the user does that, the file would be propagated to all the Drill
>> nodes.  Implementing this feature would really involve a lot of duplicating
>> with slight mods from that pull request.  What do you think?
>>>> Best,
>>>> — C
>>>> 
>>>> 
>>>> 
>>>> [1]: https://github.com/apache/drill/pull/574
>>>> 
>>>> 
>>>> 
>> 
>>

Re: Proposed Way Forward for Drill <> DFDL Integration

Reply via email to