Re: Update on DFDL/Drill

2024-07-08 Thread Mike Beckerle
Hi Charles,

The PR here: https://github.com/apache/drill/pull/2909 has all the
latest stuff, on 3.8.0 daffodil and rebased on current master.

It fails tests because it still contains an absolute file URI that
will not be found on systems other than mine. Getting this to find the
files in the DRILL_CONFIG_DIR is part of the needed changes.

For data and DFDL schema to test on, I suggest just grabbing existing
unit test files in the test resources:

Under src/test/resources/data, the file moreTypes1.txt.dat
The corresponding schema is src/test/resources/schema/moreTypes1.dfdl.xsd.

If those 2 files can be placed in DRILL_CONFIG_DIR/lib or other magic
location, and then a query like:

SELECT * FROM
  table(dfs.[[[ somehow refer to the data moreTypes1.txt.dat]]]
  (type => 'daffodil',
  validationMode => 'true',
   schemaURI => [[[somehow refer to the schema moreTypes1.dfdl.xsd]]] ,
   rootName => 'row',
   rootNamespace => null ))

that's what we need.
This should return results corresponding to the current unit test
testMoreTypes1(). Just two rows, but they illustrate most data types
working.

On Sun, Jul 7, 2024 at 11:27 AM Charles Givre  wrote:
>
> Mike,
> Thanks for the response.  Would you mind please rebasing on current master, 
> and doing the DFDL update, then sending me sample test data?  I can see what 
> I can figure out with respect to where the files go.
>
> Best,
> — C
>
>
> On Jul 5, 2024, at 12:18, Mike Beckerle  wrote:
>
> k)
>
>


Re: Update on DFDL/Drill

2024-07-05 Thread Mike Beckerle
I am trying to get back to this and finish it to a usable point.

I need to rebase on the latest drill then update to 3.8.0 daffodil.

Finishing it is however very little work I think. I just have no idea
how to do it, and need some help.

Paul Roger's suggestion is to just put DFDL schemas and data and
anything else needed (all dynamically loaded jars) into
$DRILL_CONFIG_DIR/lib as an initial version, because this is reachable
from all places drill would put the drill-bits and is on the class
path. This implies restarting drill to add a new DFDL schema, but
that's ok for a first version.

I am fine with that I just don't understand what that means for how
the code changes to access files in this location versus the way it
does now.

See this PR comment:
https://github.com/apache/drill/pull/2909#discussion_r1666968774

I also don't understand the implications of this $DRILL_CONFIG_DIR/lib
usage if there is parallelism. E.g, will a query be issued in parallel
with each drill-bit running daffodil and opening the same DFDL schema
(that's ok) and then opening the same data file (not ok).


Mike Beckerle
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Apache Daffodil PMC | daffodil.apache.org
Owl Cyber Defense | www.owlcyberdefense.com
Please note: Contributions to the DFDL Workgroup's email discussions
are subject to the OGF Intellectual Property Policy


On Wed, Jul 3, 2024 at 11:40 AM Charles Givre  wrote:
>
> Hi Mike,
> I hope all is well.  I wanted to check in with you to see how things are 
> going with the Drill/DFDL integration?   Are we close to being able to merge?
> Best,
> — C
>


Re: Drill <> DFDL

2024-04-17 Thread Mike Beckerle
Yes, I hope to retest all the Drill-Daffodil contrib with the 3.7.0 release
"real soon now", at which point it could be merged even in current form
without breaking anything in Drill.

Then work to finish it could proceed and hopefully will not require further
Daffodil changes.

On Wed, Apr 10, 2024 at 9:48 AM Charles Givre  wrote:

> Hi Mike,
> I hope all is well.  Congrats on the latest release of Daffodil!  Now that
> is done, I wanted to follow up with you about the Drill/DFDL integration.
> Are we getting closer to being able to merge this?
> Best,
> -- C


DFDL Standard approved for ISO JTC1 PAS (Pre-Approved Standard)

2024-01-22 Thread Mike Beckerle
I received notice today that the DFDL OGF standard is officially headed to
become an ISO standard. The ballot within ISO JTC1 passed 100%.

I will let you know more info when I get it about how we can propagate this
information, any branding considerations ISO JTC1 requires about it, their
trademark rules, etc.
At that point it will make sense to send this info to our users lists, make
a blog post, etc.

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com


which issue tickets, github or JIRA?

2024-01-09 Thread Mike Beckerle
https://github.com/apache/drill/issues (has 52 Issues)

vs.

https://issues.apache.org/jira/issues/?jql=project%20%3D%20DRILL%20AND%20resolution%20%3D%20Unresolved
(has 2673 issues)



Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com


[jira] [Created] (DRILL-8476) Float type displays like Double

2024-01-09 Thread Mike Beckerle (Jira)
Mike Beckerle created DRILL-8476:


 Summary: Float type displays like Double
 Key: DRILL-8476
 URL: https://issues.apache.org/jira/browse/DRILL-8476
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.20.3
Reporter: Mike Beckerle


This is the results of a test where Daffodil hands a float with 
value Float.MaxValue to drill.

Notice how drill is displaying this value as if it had double precision.
```
 org.junit.ComparisonFailure:
 Expected :\{... , 3.4028235E38,          ... }
 Actual   :\{... , 3.4028234663852886E38, ... }
 
```

This is a bug we found and fixed in Apache Daffodil also. See 
https://github.com/apache/daffodil/pull/1133 for context and discussion. 
 
I left component blank initially as I'm not sure what part of drill is 
responsible for this. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


assistance needed debugging drill + daffodil

2023-12-07 Thread Mike Beckerle
I am blocked on getting a test (testComplexQuery3) to work that contains a
row of a couple int columns plus a map column where that map contains 2
additional int fields.
Rows just containing simple integer fields work. Next step is let a column
of the top level row be a map that is a pair of additional fields, and
that's failing.

The test fails in the assert here:

@Override
public void endArrayValue() {
  assert state == State.IN_ROW;  // FAILS HERE WITH State.IDLE
  for (AbstractObjectWriter writer : writers) {
writer.events().endArrayValue();
  }
}

(That is at line 306 of AbstractTupleWriter.java)

This recursively calls endArrayValue on the child writers, and the
state of the first of these is IDLE, not IN_ROW, so it fails the
assert.

This must mean I am doing something wrong with the setup/creation of
the metadata for the map column (line 193 of
DrillDaffodilSchemaVisitor.java) ...

and/or creating and populating the data for this map column (line 177
of DaffodilDrillInfosetOutputter.java).

Any insights would be helpful.

The PR is here: https://github.com/apache/drill/pull/2836

My fork is here: https://github.com/mbeckerle/drill/tree/daffodil-2835
(that's branch daffodil-2835)

Note this fork works with the current 3.7.0-SNAPSHOT version of Apache
Daffodil, but the features in Daffodil it needs are not yet in an
"official" release.

On Linux, in daffodil 'sbt publishM2' before rebuilding drill should
do it once you have everything installed needed to build daffodil (See
BUILD.md in Daffodil).

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com


Re: status of daffodil + drill work

2023-11-14 Thread Mike Beckerle
daffodil with git hash d26a582b62c5e26b4bcac895b9cb8960c3ce8522
(2023-11-14) or newer supports the metadata and data bridges used by the
current drill integration work I have been doing.

I no longer have a special fork of the daffodil repo.

On Fri, Nov 10, 2023 at 7:00 PM Mike Beckerle  wrote:

>
> I have saved my work at this checkpoint while debugging
>
> I have junit tests working that show I can create Drill metadata and parse
> data via Drill SQL from DFDL schemas that describe what turn into Drill
> flat row-sets, with all columns being simple types (only INT at the
> moment). These work fine.
>
> The next thing to add is a column that is a map. First baby step in nested
> substructure.
>
> The test for this is testComplexQuery3
>
> This test introduces a column that is not simple/INT, it is a map.
>
> So the row now looks like {a1, a2, b: {b1, b2}} where 'b' is the map, and a1, 
> a2, b1, b2 are int type.
>
> This fails.
>
> The new 'b' map column is causing a failure when the DaffodilBatchReader 
> invokes rowSetLoader.save() to close out the row.
>
> It seems to populate the row with a1, a2, b1, and b2, and endWrite on the map 
> is called and that all works.
>
> It fails at an 'assert state == State.IN_ROW', at line 308 of 
> AbstractTupleWriter.java.
>
> So something about having added this column (which is a map), to the row, is 
> causing the state to be incorrect.
>
> If you look at my Drill PR (https://github.com/apache/drill/pull/2836) you 
> can search for FIXME.
>
> My fork repo: https://github.com/mbeckerle/drill, branch daffodil-2835.
>
> My next step is to go back to daffodil, and get all the changes I have needed 
> there integrated in and pushed to the main branch.
>
> That way at least others will have an easier time running this Drill branch 
> of mine to see what is going wrong.
>
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com
>
>
>


status of daffodil + drill work

2023-11-10 Thread Mike Beckerle
I have saved my work at this checkpoint while debugging

I have junit tests working that show I can create Drill metadata and parse
data via Drill SQL from DFDL schemas that describe what turn into Drill
flat row-sets, with all columns being simple types (only INT at the
moment). These work fine.

The next thing to add is a column that is a map. First baby step in nested
substructure.

The test for this is testComplexQuery3

This test introduces a column that is not simple/INT, it is a map.

So the row now looks like {a1, a2, b: {b1, b2}} where 'b' is the map,
and a1, a2, b1, b2 are int type.

This fails.

The new 'b' map column is causing a failure when the
DaffodilBatchReader invokes rowSetLoader.save() to close out the row.

It seems to populate the row with a1, a2, b1, and b2, and endWrite on
the map is called and that all works.

It fails at an 'assert state == State.IN_ROW', at line 308 of
AbstractTupleWriter.java.

So something about having added this column (which is a map), to the
row, is causing the state to be incorrect.

If you look at my Drill PR (https://github.com/apache/drill/pull/2836)
you can search for FIXME.

My fork repo: https://github.com/mbeckerle/drill, branch daffodil-2835.

My next step is to go back to daffodil, and get all the changes I have
needed there integrated in and pushed to the main branch.

That way at least others will have an easier time running this Drill
branch of mine to see what is going wrong.

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com


Assist needed debugging Drill + Daffodil

2023-11-07 Thread Mike Beckerle
I have had some success on Drill + Daffodil integration, but I do need
assistance now.

I am at the point where I am clearly not using the API correctly to
populate arrays, and not releasing resources that must be managed.

If you get my fork+branch of Daffodil from here
https://github.com/mbeckerle/daffodil/releases/tag/drill-exp-2023-11-07
(drill-exp2 branch)
You can unzip the binary zip file onto the
~/.m2/repository/org/apache/daffodil directory and this will enable my
fork+branch of Drill to compile and run.

My Drill fork and branch (daffodil-2835) are here:
https://github.com/mbeckerle/drill/tree/daffodil-2835

A PR with the current state of the code is here:
https://github.com/apache/drill/pull/2836

The tests that are failing are named testComplexQuery1 and
testComplexArrayQuery1.
They are
in 
contrib/format-daffodil/src/test/java/org/apache/drill/exec/store/daffodil/TestDaffodilReader.java

The "action" as it were is mostly in the files:

(1)
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilDrillInfosetOutputter.java
which accepts Daffodil SAX-style parse output events, and populates Drill
rows, columns, and arrays.
This is a stateful event handler that maintains a stack of the current
Drill TupleWriter/ArrayWriter. A breakpoint in each of its handler methods
easily shows what is happening on these small tests.

and

(2)
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
which implements the setup of the batch reader and the next() routine which
invokes the Daffodil parse() call (which then invokes the above
DaffodilDrillInfosetOutputter. This DaffodilBatchReader.java file is likely
responsible for not setting up resources to be released properly.

The testComplexQuery1 is a DFDL schema for a complex type containing two
integers.  It works, in that the Drill query result contains the two
values.

If this was creating JSON the output data would look like { "ex_r":{
"a1":"257", "a2":"258"} }

However after returning that result and after the junit test itself
indicates passing I get the output of errors in the attached
drill-testComplexQuery1-output.txt, which seems to be related to not
releasing resources properly.

The testComplexArrayQuery1 is a similar schema, but it has an array, named
'record' which is repeating/array and which contains two integers in each
array element. The test data has 3 such records.

If this was creating JSON, the output data would look like

{ "ex_r":{ "record":[
 {"a1":"257", "a2":"258"},
 {"a1":"259", "a2":"260"},
 {"a1":"261", "a2":"262"}
 ] } }

This works in that in the debugger I can see it walk all 3 test records,
and parse all the data, and it makes calls to populate data in Drill, but
it seems I am missing something important about how the arrays are properly
created, populated, and closed, as it gives an array-related error message,
and never gets to where a Drill result is returned.  See attached
drill-testComplexArrayQuery1-output.txt.

Any help is greatly appreciated.

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com
#: `ex_r` STRUCT<`a1` INT NOT NULL, `a2` INT NOT NULL>
0: {257, 258}


java.lang.RuntimeException: Exception while closing

at 
org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:46)
at org.apache.drill.exec.client.DrillClient.close(DrillClient.java:481)
at org.apache.drill.test.ClientFixture.close(ClientFixture.java:259)
at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:91)
at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:71)
at org.apache.drill.test.ClusterTest.shutdown(ClusterTest.java:88)
Caused by: java.lang.IllegalStateException: Allocator[ROOT] closed with 
outstanding buffers allocated (1).
Allocator(ROOT) 0/128/4352/2684354560 (res/actual/peak/limit)
  child allocators: 0
  ledgers: 1
ledger[68] allocator: ROOT), isOwning: true, size: 128, references: 2, 
life: 315432066733998..0, allocatorManager: [62, life: 315432066691268..0] 
holds 8 buffers. 
DrillBuf[94], udle: [63 100..104]
DrillBuf[82], udle: [63 0..128]
DrillBuf[83], udle: [63 10..94]
DrillBuf[92], udle: [63 96..100]
DrillBuf[93], udle: [63 100..104]
DrillBuf[84], udle: [63 96..104]
DrillBuf[91], udle: [63 96..100]
DrillBuf[90], udle: [63 96..104]
  reservations: 0

at 
org.apache.drill.exec.memory.BaseAllocator.close(BaseAllocator.java:502)
at 
org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:44)
... 5 more

Found one or m

Re: Config Questions

2023-10-31 Thread Mike Beckerle
Sure, I'd be happy to chat on MS-Teams (preferred, since I have it
installed, mbecke...@owlcyberdefense.com) or google meet. (
mbeckerle.d...@gmail.com)

I don't have zoom installed but I think one can use it from a browser maybe?

I am free now until 4pm today, Wednesday 9am to 3pm are open.

My mobile work number is 781-330-0412.


On Tue, Oct 31, 2023 at 12:47 PM Charles Givre  wrote:

> Hi Mike,
> I had a look at your branch, but I couldn't build it because my machine
> couldn't find your version of Daffodil.  I don't want to push to your
> branch directly w/o permission again, but I looked at your unit tests and
> have some suggestions that might clarify things.  Would you be open to a
> brief zoom call sometime?  I think it might be quicker if we took 15 min
> and I can explain all this vs the back and forth over email.
>
> If not.. I'll write it up.
> Best,
> -- C
>
>


Re: Config Questions

2023-10-31 Thread Mike Beckerle
You can push to my fork. Just put your stuff on a branch so I can
isolate/merge it if I have other local changes going on simultaneously.

The Daffodil fork I have, and branch, is
https://github.com/mbeckerle/daffodil and my branch is "drill-exp"

I made a sort of "release" of it here:
https://github.com/mbeckerle/daffodil/releases/tag/3.7.0-SNAPSHOT-drill-exp-2023-10-30.
The zip file is the ~/.m2/repository/org/apache/daffodil directory, which
unzipped should let you link against daffodil.


On Tue, Oct 31, 2023 at 12:47 PM Charles Givre  wrote:

> Hi Mike,
> I had a look at your branch, but I couldn't build it because my machine
> couldn't find your version of Daffodil.  I don't want to push to your
> branch directly w/o permission again, but I looked at your unit tests and
> have some suggestions that might clarify things.  Would you be open to a
> brief zoom call sometime?  I think it might be quicker if we took 15 min
> and I can explain all this vs the back and forth over email.
>
> If not.. I'll write it up.
> Best,
> -- C
>
>


Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-18 Thread Mike Beckerle
I am very much hoping someone will look at my open PR soon.
https://github.com/apache/drill/pull/2836

I am basically blocked on this effort until you help me with one key area
of that.

I expect the part I am puzzling over is routine to you, so it will save me
much effort.

This is the key area in the DaffodilBatchReader.java code:

  // FIXME: Next, a MIRACLE occurs.
  //
  // We get the dfdlSchemaURI filled in from the query, or a default config
location
  // We get the rootName (or null if not supplied) from the query, or a
default config location
  // We get the rootNamespace (or null if not supplied) from the query, or
a default config location
  // We get the validationMode (true/false) filled in from the query or a
default config location
  // We get the dataInputURI filled in from the query, or from a default
config location
  //
  // For a first cut, let's just fake it. :-)
  boolean validationMode = true;
  URI dfdlSchemaURI = new URI("schema/complexArray1.dfdl.xsd");
  String rootName = null;
  String rootNamespace = null;
  URI dataInputURI = new URI("data/complexArray1.dat");


I imagine this is just a few lines of code to grab these from the query,
and i don't even care about config files for now.

I gave up on trying to figure out how to do this myself. It was actually
quite unclear from looking at the other format plugins. The way Drill does
configuration is obviously motivated by the distributed architecture
combined with pluggability, but all that combined with the negotation over
schemas which extends into runtime, and it all became quite muddy to me. I
think what I need is super straightforward, so i figured I should just
ask.

This is just to get enough working (against local files only) that I can be
unblocked on creating and testing the rest of the Daffodil-to-Drill
metadata bridge and data bridge.

My plan is to get all kinds of data and queries working first but just
against local-only files.  Fixing it to work in distributed Drill can come
later.

-mikeb

On Wed, Oct 18, 2023 at 2:11 PM Paul Rogers  wrote:

> Hi Charles,
>
> The persistent store is just ZooKeeper, and ZK is known to work poorly as
> a distributed DB. ZK works great for things like tokens, node registrations
> and the like. But, ZK scales very poorly for things like schemas (or query
> profiles or a list of active queries.)
>
> A more scalable approach may be to cache the schemas in each Drillbit,
> then translate them to Drill's format and include them in each Scan
> operator definition sent to each execution Drillbit. That solution avoids
> race conditions when the schemas change while a query is in flight. This
> is, in fact, the model used for storage plugin definitions. (The storage
> plugin definitions are, in fact, stored in ZK, but tend to be small and few
> in number.)
>
> - Paul
>
>
> On Wed, Oct 18, 2023 at 7:51 AM Charles Givre  wrote:
>
>> Hi Mike,
>> I hope all is well.  I remembered one other piece which might be useful
>> for you.  Drill has an interface called a PersistentStore which is used for
>> storing artifacts such as tokens etc.  I've uesd it on two occasions: in
>> the GoogleSheets plugin and the Http plugin.  In both cases, I used it to
>> store OAuth user tokens which need to be preserved and shared across
>> drillbits, and also frequently updated.  I was thinking that this might be
>> useful for caching the DFDL schemata.  If you take a look here:
>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java,
>>
>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth.
>> and here
>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java,
>> you can see how I used that.
>>
>> Best,
>> -- C
>>
>>
>>
>>
>>
>>
>> > On Oct 13, 2023, at 1:25 PM, Mike Beckerle 
>> wrote:
>> >
>> > Very helpful.
>> >
>> > Answers to your questions, and comments are below:
>> >
>> > On Thu, Oct 12, 2023 at 5:14 PM Charles Givre > <mailto:cgi...@gmail.com>> wrote:
>> >> HI Mike,
>> >> I hope all is well.  I'll take a stab at answering your questions.
>> But I have a few questions as well:
>> >>
>> >> 1.  Are you writing a storage or format plugin for DFDL?  My thinking
>> was that this would be a format plugin, but let me know if you were
>> thinking differently
>> >
>> > Format plugin.
>> >
>> >> 2.  In traditional deployments, where do people store the DFDL
>> schemata files?  Are they local or ac

Fwd: [apache/drill] WIP: Preliminary Review on adding Daffodil to Drill (PR #2836)

2023-10-13 Thread Mike Beckerle
My PR needs input from drill developers.

Please look for TODO and FIXME in this PR and help me get to where I can
initialize this plugin.

In general I copied things from format-xml contrib, but then took ideas
from Json. I was unable to figure out how initialization works from the
Excel plugin.

The metadata bridge is here, and a stub of the data bridge - handles only
simple type "INT" right now, and of course doesn't compile yet.

https://github.com/apache/drill/pull/2836


-- Forwarded message -----
From: Mike Beckerle 
Date: Fri, Oct 13, 2023 at 11:11 PM
Subject: [apache/drill] WIP: Preliminary Review on adding Daffodil to Drill
(PR #2836)
To: apache/drill 
Cc: Mike Beckerle , Your activity <
your_activ...@noreply.github.com>


DRILL-2835 <https://issues.apache.org/jira/browse/DRILL-2835>: Preliminary
Review on adding Daffodil to Drill Description

New format-daffodil module created. But I need assistance with several
aspects.

Tests of creating Drill schemas from DFDL working. They're simple, but it's
showing promise.

There are major TODO/FIXME/TBDs in here. Search for FIXME, and "Then a
MIRACLE occurs..."

This does not compile yet because of the plugin system and how to
initialize things. This is the main open problem to get it to compile
without error.

Needs review by Drill-devs.
Documentation

TBD: This will require doc eventually
Testing

Needs more. This is just a preliminary design review Work-in-progress.
--
You can view, comment on, or merge this pull request online at:

  https://github.com/apache/drill/pull/2836
Commit Summary

   - 0633fdb
   
<https://github.com/apache/drill/pull/2836/commits/0633fdbe61bc073cdc1b4fb81551829ff83ecc6c>
   Checkpoint on adding Daffodil to Drill

File Changes

(25 files <https://github.com/apache/drill/pull/2836/files>)

   - *A* contrib/format-daffodil/.gitignore
   
<https://github.com/apache/drill/pull/2836/files#diff-0a627354a4cea5c437f263321497755e8fa3ce7ee4b6e626fee2f3583e1d0416>
   (2)
   - *A* contrib/format-daffodil/README.md
   
<https://github.com/apache/drill/pull/2836/files#diff-e96a4fb1ae8531d3b3b3d2aae1611521d2866633e231bf753f958399a6fdd795>
   (41)
   - *A* contrib/format-daffodil/pom.xml
   
<https://github.com/apache/drill/pull/2836/files#diff-5c8e06605ea68a8079bd3cb46a8798a530dba2dd21502c546e85dfec94261505>
   (89)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
   
<https://github.com/apache/drill/pull/2836/files#diff-162d1de445d6df5a93c0f3e8aec31db247451e4ff96ec397cfabca3352f1a672>
   (180)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilDrillInfosetOutputter.java
   
<https://github.com/apache/drill/pull/2836/files#diff-1fe59684b38eb8da600975b5eb3f2d4f4ab4f87bc813ca418ea91f58336be223>
   (105)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilFormatConfig.java
   
<https://github.com/apache/drill/pull/2836/files#diff-72e26d5dc31d50edbdd44fb7c165e215cfeed2d58e610d3c7f0621f1a9832d77>
   (97)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilFormatPlugin.java
   
<https://github.com/apache/drill/pull/2836/files#diff-10ba98388e809944fd5f95092d3675c84860d3102c7c9e3237debc4d70c93321>
   (87)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilMessageParser.java
   
<https://github.com/apache/drill/pull/2836/files#diff-f886e2832e2f734876a672d095fae5004c7c58ee92db5b27623189bba3bc8eaf>
   (187)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/schema/DaffodilDataProcessorFactory.java
   
<https://github.com/apache/drill/pull/2836/files#diff-cef9515d31cdfd518179f5ec8d81f31c96d10eb31ef9d138f0d6696a838f01fc>
   (130)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/schema/DrillDaffodilSchemaUtils.java
   
<https://github.com/apache/drill/pull/2836/files#diff-c250676beeb4499b8ab42cfa027379a0419290f0d2a2024dc21b213607b28796>
   (107)
   - *A*
   
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/schema/DrillDaffodilSchemaVisitor.java
   
<https://github.com/apache/drill/pull/2836/files#diff-31b55c328a69af81ef7e2a4a5794dc902573ed52893d4179bb07020f224ce969>
   (121)
   - *A*
   contrib/format-daffodil/src/main/resources/bootstrap-format-plugins.json
   
<https://github.com/apache/drill/pull/2836/files#diff-bd164b495abbb9f17b35e5548f7c8df710f084657b6f927b0b08b61b0e0510fb>
   (26)
   - *A* contrib/format-daffodil/src/main/resources/drill-module.conf
   
<https://github.com/apache/drill/pull/2836/files#diff-ba82af9174d100bf1b07cc20c85feeb09fcdaee7aa291610ff5aaf1f13ddd8ee>
   (25)
   - *A*
   
contrib/format-daffodil/src/test/j

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-13 Thread Mike Beckerle
e, the next thing would be to convert
> that into a Drill schema.  Let's say that we have a function called
> dfdlToDrill that handles the conversion.
>
> What you'd have to do is in the constructor for the BatchReader, you'd
> have to set the schema there.  So pseudo code:
>
> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan,
> FileSchemaNegotiator negotiator) {
> // Other stuff...
>
> // Get Drill schema from DFDL
> TupleMetadata schema = dfldToDrill( // Here's the important part
>negotiator.tableSchema(schema, true);
> }
>
> The negotiator.tableSchema() accepts two args, a TupleMetadata and a
> boolean as to whether the schema is final or not.  Once this schema has
> been added to the negotiator object, you can then create the writers.
>
>
That negotiator.tableSchema() is ideal. I was hoping that this was going to
be the only place the metadata had to be given to drill. Excellent.


>
> Take a look here...
>
> [image: drill.png]
>
> drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill
> <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>
> github.com
> <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>
>
> <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>
>
>
> I see Paul just responded so I'll leave you with this.  If you have
> additional questions, send them our way.  Do take a look at the Excel
> plugin as I think it will be helpful.
>
> Yes, I've found the JsonLoaderImpl.readBatch() method, and Daffodil can
work similarly.

This will take me a few more days to get to a pull request. The first one
will be initial review, i.e., not intended to merge without more tests.
Probably it will support only integer data fields, but should support lots
of data shapes including vectors, choices, sequences, nested records, etc.

Thanks for the help.


>
> On Oct 12, 2023, at 2:58 PM, Mike Beckerle  wrote:
>
> So when a data format is described by a DFDL schema, I can generate
> equivalent Drill schema (TupleMetadata). This schema is always complete. I
> have unit tests working with this.
>
> To do this for a real SQL query, I need the DFDL schema to be identified on
> the SQL query by a file path or URI.
>
> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
>
> Next, assuming I have the DFDL schema identified, I generate an equivalent
> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
>
> What objects do I call, or what classes do I have to create to make this
> Drill TupleMetadata available to Drill so it uses it in all the ways a
> static Drill schema can be useful?
>
> I just need pointers to the code that illustrate how to do this. Thanks
>
> -Mike Beckerle
>
>
>
>
>
>
>
>
>
>
> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers  wrote:
>
> Mike,
>
> This is a complex question and has two answers.
>
> First, the standard enhanced vector framework (EVF) used by most readers
> assumes a "pull" model: read each record. This is where the next() comes
> in: readers just implement this to read the next record. But, the code
> under EVF works with a push model: the readers write to vectors, and signal
> the next record. EVF translates the lower-level push model to the
> higher-level, easier-to-use pull model. The best example of this is the
> JSON reader which uses Jackson to parse JSON and responds to the
> corresponding events.
>
> You can thus take over the task of filling a batch of records. I'd have to
> poke around the code to refresh my memory. Or, you can take a look at the
> (quite complex) JSON parser, or the EVF itself to see what it does. There
> are many unit tests that show this at various levels of abstraction.
>
> Basically, you have to:
>
> * Start a batch
> * Ask if you can start the next record (which might be declined if the
> batch is full)
> * Write each field. For complex fields, such as records, recursively do the
> start/end record work.
> * Mark the record as complete.
>
> You should be able to map event handlers to EVF actions as a result. Even
> though DFDL wants to "drive", it still has to give up control once the
> batch is full. EVF will then handle the (surprisingly complex) task of
> finishing up the batch and returning 

Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-12 Thread Mike Beckerle
So when a data format is described by a DFDL schema, I can generate
equivalent Drill schema (TupleMetadata). This schema is always complete. I
have unit tests working with this.

To do this for a real SQL query, I need the DFDL schema to be identified on
the SQL query by a file path or URI.

Q: How do I get that DFDL schema File/URI parameter from the SQL query?

Next, assuming I have the DFDL schema identified, I generate an equivalent
Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)

What objects do I call, or what classes do I have to create to make this
Drill TupleMetadata available to Drill so it uses it in all the ways a
static Drill schema can be useful?

I just need pointers to the code that illustrate how to do this. Thanks

-Mike Beckerle










On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers  wrote:

> Mike,
>
> This is a complex question and has two answers.
>
> First, the standard enhanced vector framework (EVF) used by most readers
> assumes a "pull" model: read each record. This is where the next() comes
> in: readers just implement this to read the next record. But, the code
> under EVF works with a push model: the readers write to vectors, and signal
> the next record. EVF translates the lower-level push model to the
> higher-level, easier-to-use pull model. The best example of this is the
> JSON reader which uses Jackson to parse JSON and responds to the
> corresponding events.
>
> You can thus take over the task of filling a batch of records. I'd have to
> poke around the code to refresh my memory. Or, you can take a look at the
> (quite complex) JSON parser, or the EVF itself to see what it does. There
> are many unit tests that show this at various levels of abstraction.
>
> Basically, you have to:
>
> * Start a batch
> * Ask if you can start the next record (which might be declined if the
> batch is full)
> * Write each field. For complex fields, such as records, recursively do the
> start/end record work.
> * Mark the record as complete.
>
> You should be able to map event handlers to EVF actions as a result. Even
> though DFDL wants to "drive", it still has to give up control once the
> batch is full. EVF will then handle the (surprisingly complex) task of
> finishing up the batch and returning it as the output of the Scan operator.
>
> - Paul
>
> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle 
> wrote:
>
> > Daffodil parsing generates event callbacks to an InfosetOutputter, which
> is
> > analogous to a SAX event handler.
> >
> > Drill is expecting an iterator style of calling next() to advance through
> > the input, i.e., Drill has the control thread and expects to do pull
> > parsing. At least from the code I studied in the format-xml contrib.
> >
> > Is there any alternative? Before I dig into creating another one of these
> > co-routine-style control inversions (which have proven to be problematic
> > for performance.
> >
>


Drill expects pull parsing? Daffodil is event callbacks style

2023-10-11 Thread Mike Beckerle
Daffodil parsing generates event callbacks to an InfosetOutputter, which is
analogous to a SAX event handler.

Drill is expecting an iterator style of calling next() to advance through
the input, i.e., Drill has the control thread and expects to do pull
parsing. At least from the code I studied in the format-xml contrib.

Is there any alternative? Before I dig into creating another one of these
co-routine-style control inversions (which have proven to be problematic
for performance.


Question about Drill internal data representation for Daffodil tree infosets

2023-10-10 Thread Mike Beckerle
I am trying to understand the options for populating Drill data from a
Daffodil data parse.

Suppose you have this JSON

{"parent": { "sub1": { "a1":1, "a2":2}, sub2:{"b1":3, "b2":4, "b3":5}}}

or this equivalent XML:


  12
  345


Unlike those texts, Daffodil is going to have a tree data structure where a
parent node contains two child nodes sub1 and sub2, and each of those has
children a1, a2, and b1, b2, b3 respectively.
It's analogous roughly to the DOM tree of the XML, or the tree of nested
JSON map nodes you'd get back from a JSON parse of that text.

In Drill to query the JSON like:

select parent.sub1 from myStructure

gives you back single column containing what seems to be a string like

|sub1|
--
| { "a1":1, "a2":2}  |

So, my question is this. Is this actually a string in Drill, (what is the
type of sub1?) or is sub1 actually a Drill data row/map node value with two
node children, that just happens to print out looking like a JSON string?

Thanks for any insight here.

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com


Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-10-08 Thread Mike Beckerle
Or this single query:

with pcapDoc as (select PCAP from `infoset.json`),
 packets as (select flatten(pcapDoc.PCAP.Packet) as packet from
pcapDoc),
 ipv4Headers as (select
packets.packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header as hdr from
packets),
 ipsrcs as (select ipv4headers.hdr.IPSrc.value as ip from ipv4Headers),
 ipdests as (select ipv4headers.hdr.IPDest.value as ip from
ipv4Headers),
 ips as (select ip from ipsrcs union select ip from ipdests)
select * from ips;


On Sun, Oct 8, 2023 at 2:47 PM Mike Beckerle  wrote:

> Ok. It took some time but putting the infoset.json attachment into /tmp
> this SQL pulls out all the IP addresses from the PCAP data:
>
> use dfs.tmp;
>
> create or replace view pcapDoc as select PCAP from `infoset.json`;
>
> create or replace view packets as select flatten(pcapDoc.PCAP.Packet) as
> packet from pcapDoc;
>
> create or replace view ipv4Headers as select
> packets.packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header as hdr from
> packets ;
>
> create or replace view ipsrcs as select ipv4headers.hdr.IPSrc.value as ip
> from ipv4Headers;
>
> create or replace view ipdests as select ipv4headers.hdr.IPDest.value as
> ip from ipv4Headers;
>
> create or replace view ips as select ip from ipsrcs union select ip from
> ipdests;
>
> select * from ips;
>
> On Wed, Sep 13, 2023 at 9:43 AM Mike Beckerle 
> wrote:
>
>> ... sound of crickets on a summer night .
>>
>> It would really help me if I could get a response to this inquiry, to
>> help me better understand Drill.
>>
>> I do realize people are busy, and also this was originally sent Aug 25,
>> which is the wrong time of year to get timely response to anything.
>> Hence, this re-up of the message.
>>
>>
>> On Fri, Aug 25, 2023 at 7:39 PM Mike Beckerle 
>> wrote:
>>
>>> Below is a small JSON output from Daffodil and below that is the same
>>> Infoset output as XML.
>>> (They're inline in this message, but I also attached them as files)
>>>
>>> This is just a parse of a small PCAP file with a few ICMP packets in it.
>>> It's an example DFDL schema used to illustrate binary file parsing.
>>>
>>> (The schema is here https://github.com/DFDLSchemas/PCAP which uses this
>>> component schema: https://github.com/DFDLSchemas/ethernetIP)
>>>
>>> My theory is that Drill queries against these should be identical to
>>> obtain the same output row contents.
>>> That is, since this data has the same schema, whether it is JSON or XML
>>> shouldn't affect how you query it.
>>> To do that the XML Reader will need the XML schema (or some
>>> hand-provided metadata) so it knows what is an array. (Specifically
>>> PCAP.Packet is the array.)
>>>
>>> E.g., if you wanted to get the IPSrc and IPDest fields in a table from
>>> all ICMP packets in this file, that query should be the same for the JSON
>>> and the XML data.
>>>
>>> First question: Does that make sense? I want to make sure I'm
>>> understanding this right.
>>>
>>> Second question, since I don't really understand Drill SQL yet.
>>>
>>> What is a query that would pluck the IPSrc.value and IPDest.value from
>>> this data and make a row of each pair of those?
>>>
>>> The top level is a map with a single element named PCAP.
>>> The "table" is PCAP.Packet which is an array (of maps).
>>> And within each array item's map the fields of interest are within
>>> LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header
>>> (so maybe IPv4Header is the table?)
>>> The two fields within there are IPSrc.value (AS src) and IPDest.value
>>> (AS dest)
>>>
>>> I'm lost on how to tell the query that the table is the array
>>> PCAP.Packet, or the IPv4Header within those maybe?
>>>
>>> Maybe this is easy, but I'm just not grokking it yet so I could use some
>>> help here.
>>>
>>> Thanks in advance.
>>>
>>> {
>>> "PCAP": {
>>> "PCAPHeader": {
>>> "MagicNumber": "D4C3B2A1",
>>> "Version": {
>>> "Major": "2",
>>> "Minor": "4"
>>> },
>>> "Zone": "0",
>>> "SigFigs": "0",
>>> "SnapLen": "65535",
>>> "Network": "1"
>>> },
>>> "Packet": [
>>> {
>>> "PacketHeader&

Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-10-08 Thread Mike Beckerle
Ok. It took some time but putting the infoset.json attachment into /tmp
this SQL pulls out all the IP addresses from the PCAP data:

use dfs.tmp;

create or replace view pcapDoc as select PCAP from `infoset.json`;

create or replace view packets as select flatten(pcapDoc.PCAP.Packet) as
packet from pcapDoc;

create or replace view ipv4Headers as select
packets.packet.LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header as hdr from
packets ;

create or replace view ipsrcs as select ipv4headers.hdr.IPSrc.value as ip
from ipv4Headers;

create or replace view ipdests as select ipv4headers.hdr.IPDest.value as ip
from ipv4Headers;

create or replace view ips as select ip from ipsrcs union select ip from
ipdests;

select * from ips;

On Wed, Sep 13, 2023 at 9:43 AM Mike Beckerle  wrote:

> ... sound of crickets on a summer night .
>
> It would really help me if I could get a response to this inquiry, to help
> me better understand Drill.
>
> I do realize people are busy, and also this was originally sent Aug 25,
> which is the wrong time of year to get timely response to anything.
> Hence, this re-up of the message.
>
>
> On Fri, Aug 25, 2023 at 7:39 PM Mike Beckerle 
> wrote:
>
>> Below is a small JSON output from Daffodil and below that is the same
>> Infoset output as XML.
>> (They're inline in this message, but I also attached them as files)
>>
>> This is just a parse of a small PCAP file with a few ICMP packets in it.
>> It's an example DFDL schema used to illustrate binary file parsing.
>>
>> (The schema is here https://github.com/DFDLSchemas/PCAP which uses this
>> component schema: https://github.com/DFDLSchemas/ethernetIP)
>>
>> My theory is that Drill queries against these should be identical to
>> obtain the same output row contents.
>> That is, since this data has the same schema, whether it is JSON or XML
>> shouldn't affect how you query it.
>> To do that the XML Reader will need the XML schema (or some hand-provided
>> metadata) so it knows what is an array. (Specifically PCAP.Packet is the
>> array.)
>>
>> E.g., if you wanted to get the IPSrc and IPDest fields in a table from
>> all ICMP packets in this file, that query should be the same for the JSON
>> and the XML data.
>>
>> First question: Does that make sense? I want to make sure I'm
>> understanding this right.
>>
>> Second question, since I don't really understand Drill SQL yet.
>>
>> What is a query that would pluck the IPSrc.value and IPDest.value from
>> this data and make a row of each pair of those?
>>
>> The top level is a map with a single element named PCAP.
>> The "table" is PCAP.Packet which is an array (of maps).
>> And within each array item's map the fields of interest are within
>> LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header
>> (so maybe IPv4Header is the table?)
>> The two fields within there are IPSrc.value (AS src) and IPDest.value (AS
>> dest)
>>
>> I'm lost on how to tell the query that the table is the array
>> PCAP.Packet, or the IPv4Header within those maybe?
>>
>> Maybe this is easy, but I'm just not grokking it yet so I could use some
>> help here.
>>
>> Thanks in advance.
>>
>> {
>> "PCAP": {
>> "PCAPHeader": {
>> "MagicNumber": "D4C3B2A1",
>> "Version": {
>> "Major": "2",
>> "Minor": "4"
>> },
>> "Zone": "0",
>> "SigFigs": "0",
>> "SnapLen": "65535",
>> "Network": "1"
>> },
>> "Packet": [
>> {
>> "PacketHeader": {
>> "Seconds": "1371631556",
>> "USeconds": "838904",
>> "InclLen": "74",
>> "OrigLen": "74"
>> },
>> "LinkLayer": {
>> "Ethernet": {
>> "MACDest": "005056E01449",
>> "MACSrc": "000C29340BDE",
>> "Ethertype": "2048",
>> "NetworkLayer": {
>> "IPv4": {
>> "IPv4Header": {
>> "Version": "4",
>> "IHL": "5",
>> "DSCP": "0",
>> "ECN": "0",
>> "Length": "60",
>> "Identification": "55107",
>> "Flags": "0",
>> "FragmentOffset": "0",
>> "TTL": "128",
>> "Protocol": "1",
>> "Ch

Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)

2023-10-08 Thread Mike Beckerle
Nevermind. I figured this out. Was due to 'properties' being a reserved
keyword. I created a PR to fix the JSON doc on the drill site.

On Sat, Oct 7, 2023 at 1:46 PM Mike Beckerle  wrote:

> Ok, after weeks of delay
>
> That helps a great deal. You flatten the array of maps into a table of
> maps.
>
> I am confused still about when I must do square brackets versus dot
> notation: data['a'] vs. data.a
> The JSON documentation for Drill uses dot notation to reach into fields of
> a map.
>
> Ex: from the JSON doc:
>
> {
>   "type": "FeatureCollection",
>   "features": [
>   {
> "type": "Feature",
> "properties":
> {
>   "MAPBLKLOT": "0001001",
>   "BLKLOT": "0001001",
>   "BLOCK_NUM": "0001",
>   "LOT_NUM": "001",
>
>
> The query uses SELECT features[0].properties.MAPBLKLOT, FROM ...
> Which is using dot notation where in your queries on my JSON you did not
> use dot notation.
>
> I tried revising the queries you wrote using the dot notation, and it was
> rejected. "no table named 'data'", but I'm not sure why.
>
> Ex:
>
> This works: (your original working query)
>
> SELECT data['a'], data['b'] FROM (select flatten(record) AS data from
> dfs.`/tmp/record.json`) WHERE data['b']['b1'] > 60.0;
>
> But this fails:
>
> SELECT data.a AS a, data.b AS b FROM (select flatten(record) AS data from
> dfs.`/tmp/record.json`) WHERE data.b.b1 > 60.0;
> Error: VALIDATION ERROR: From line 1, column 105 to line 1, column 108:
> Table 'data' not found
>
> But your sub-select defines 'data' as, I would assume, a table.
>
> Can you help me clarify this?
>
> [Error Id: 90c03b40-4f00-43b5-9de9-598102797b2f ] (state=,code=0)
> apache drill>
>
>
> On Mon, Sep 18, 2023 at 11:17 PM Charles Givre  wrote:
>
>> Hi Mike,
>> Let me answer your question with some queries:
>>
>>  >>> select * from dfs.test.`record.json`;
>>
>> +--+
>> |  record
>>  |
>>
>> +--+
>> |
>> [{"a":{"a1":5.0,"a2":6.0},"b":{}},{"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}},{"a":{"a1":7.0,"a2":8.0},"b":{}},{"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}}]
>> |
>>
>> +--+
>>
>> Now... I can flatten that like this:
>>
>> >>> select flatten(record) AS data from dfs.test.`record.json`;
>> +--+
>> | data |
>> +--+
>> | {"a":{"a1":5.0,"a2":6.0},"b":{}} |
>> | {"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}} |
>> | {"a":{"a1":7.0,"a2":8.0},"b":{}} |
>> | {"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}} |
>> +--+
>> 4 rows selected (0.298 seconds)
>>
>> You asked about filtering.   For this, I broke it up into a subquery, but
>> here's how I did that:
>>
>> >>> SELECT data['a'], data['b']
>> 2..semicolon> FROM (select flatten(record) AS data from
>> dfs.test.`record.json`)
>> 3..semicolon> WHERE data['b']['b1'] > 60.0;
>> ++-+
>> | EXPR$0 | EXPR$1  |
>> ++-+
>> | {} | {"b1":77.0,"b2":88.0,"b3":99.0} |
>> ++-+
>> 1 row selected (0.379 seconds)
>>
>> I did all this without the union data type.
>>
>> Does this make sense?
>> Best,
>> -- C
>>
>>
>> On Sep 13, 2023, at 11:08 AM, Mike Beckerle  wrote:
>>
>> I'm thinking whether a first prototype of DFDL integration to Drill should
>> just use JSON.
>>
>> But please consider this JSON:
>>
>> { "record": [
>>{ "a": { "a1":5, "a2":6 } },
&

Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)

2023-10-07 Thread Mike Beckerle
Ok, after weeks of delay

That helps a great deal. You flatten the array of maps into a table of maps.

I am confused still about when I must do square brackets versus dot
notation: data['a'] vs. data.a
The JSON documentation for Drill uses dot notation to reach into fields of
a map.

Ex: from the JSON doc:

{
  "type": "FeatureCollection",
  "features": [
  {
"type": "Feature",
"properties":
{
  "MAPBLKLOT": "0001001",
  "BLKLOT": "0001001",
  "BLOCK_NUM": "0001",
  "LOT_NUM": "001",
   

The query uses SELECT features[0].properties.MAPBLKLOT, FROM ...
Which is using dot notation where in your queries on my JSON you did not
use dot notation.

I tried revising the queries you wrote using the dot notation, and it was
rejected. "no table named 'data'", but I'm not sure why.

Ex:

This works: (your original working query)

SELECT data['a'], data['b'] FROM (select flatten(record) AS data from
dfs.`/tmp/record.json`) WHERE data['b']['b1'] > 60.0;

But this fails:

SELECT data.a AS a, data.b AS b FROM (select flatten(record) AS data from
dfs.`/tmp/record.json`) WHERE data.b.b1 > 60.0;
Error: VALIDATION ERROR: From line 1, column 105 to line 1, column 108:
Table 'data' not found

But your sub-select defines 'data' as, I would assume, a table.

Can you help me clarify this?

[Error Id: 90c03b40-4f00-43b5-9de9-598102797b2f ] (state=,code=0)
apache drill>


On Mon, Sep 18, 2023 at 11:17 PM Charles Givre  wrote:

> Hi Mike,
> Let me answer your question with some queries:
>
>  >>> select * from dfs.test.`record.json`;
>
> +--+
> |  record
>  |
>
> +--+
> |
> [{"a":{"a1":5.0,"a2":6.0},"b":{}},{"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}},{"a":{"a1":7.0,"a2":8.0},"b":{}},{"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}}]
> |
>
> +--+
>
> Now... I can flatten that like this:
>
> >>> select flatten(record) AS data from dfs.test.`record.json`;
> +--+
> | data |
> +--+
> | {"a":{"a1":5.0,"a2":6.0},"b":{}} |
> | {"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}} |
> | {"a":{"a1":7.0,"a2":8.0},"b":{}} |
> | {"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}} |
> +--+
> 4 rows selected (0.298 seconds)
>
> You asked about filtering.   For this, I broke it up into a subquery, but
> here's how I did that:
>
> >>> SELECT data['a'], data['b']
> 2..semicolon> FROM (select flatten(record) AS data from
> dfs.test.`record.json`)
> 3..semicolon> WHERE data['b']['b1'] > 60.0;
> ++-+
> | EXPR$0 | EXPR$1  |
> ++-+
> | {} | {"b1":77.0,"b2":88.0,"b3":99.0} |
> ++-+
> 1 row selected (0.379 seconds)
>
> I did all this without the union data type.
>
> Does this make sense?
> Best,
> -- C
>
>
> On Sep 13, 2023, at 11:08 AM, Mike Beckerle  wrote:
>
> I'm thinking whether a first prototype of DFDL integration to Drill should
> just use JSON.
>
> But please consider this JSON:
>
> { "record": [
>{ "a": { "a1":5, "a2":6 } },
>{ "b": { "b1":55, "b2":66, "b3":77 } }
>{ "a": { "a1":7, "a2":8 } },
>{ "b": { "b1":77, "b2":88, "b3":99 } }
>  ] }
>
> It corresponds to this text data file, parsed using Daffodil:
>
>105062556677107082778899
>
> The file is a stream of records. The first byte is a tag value 1 for type
> 'a' records, and 2 for type 'b' records.
> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
> a2. They are integers.
> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, b2,
> and b3. They are in

Question on Representing DFDL/XSD choice data for Drill (Unions required?)

2023-09-13 Thread Mike Beckerle
I'm thinking whether a first prototype of DFDL integration to Drill should
just use JSON.

But please consider this JSON:

{ "record": [
{ "a": { "a1":5, "a2":6 } },
{ "b": { "b1":55, "b2":66, "b3":77 } }
{ "a": { "a1":7, "a2":8 } },
{ "b": { "b1":77, "b2":88, "b3":99 } }
  ] }

It corresponds to this text data file, parsed using Daffodil:

105062556677107082778899

The file is a stream of records. The first byte is a tag value 1 for type
'a' records, and 2 for type 'b' records.
The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
a2. They are integers.
The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, b2,
and b3. They are integers.
This kind of format is very common, even textualized like this (from COBOL
programs for example)

Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
(and ... does this require the experimental Union feature?)

b1, b2
-
(55, 66)
(77, 88)

I ask because in an XML Schema or DFDL schema choices with dozens of
'branches' are very common.
Ex: schema for the above data:


   
  
  
   

... many child elements let's say named a1, a2, ...
 
   
  
  
   

... many child elements let's say named b1, b2, b3
...
 
   
  

  


To me XSD choice naturally requires a Union feature of some sort.
If that's expermental still in Drill ... what to do?

On Sun, Aug 6, 2023 at 10:19 AM Charles S. Givre 
wrote:

> @mbeckerle 
> You've encountered another challenge that exists in Drill reading data
> without a schema.
> Let me explain a bit about this and I'm going to use the JSON reader as an
> example. First Drill requires data to be homogeneous. Drill does have a
> Union vector type which allows heterogeneous data however this is a bit
> experimental and I wouldn't recommend using it. Also, it really just shifts
> schema inconsistencies to the user.
>
> For instance, let's say you have a column consisting of strings and
> floats. What happens if you try to do something like this:
>
> SELECT sum(mixed_col)-- orSELECT ORDER BY mixed_col
>
> Remembering that Drill is distributed and if you have a column with the
> same name and you try to do these operations, they will fail.
>
> Let's say we have data like this:
>
> [
>   {
>  'col1': 'Hi there',
>  'col2': 5.0
>   },
>   {
>  'col1':True,
>  'col2': 4,
>  'col3': 'foo'
>   }
> ]
>
> In older versions of Drill, this kind of data, this would throw all kinds
> of SchemaChangeExceptions. However, in recent versions of Drill, @jnturton
>  submitted apache#2638
>  which overhauled implicit
> casting. What this meant for users is that col2 in the above would be
> automatically cast to a FLOAT and col1 would be automatically cast to a
> VARCHAR.
>
> However, when reading data the story is a little different. What we did
> for the JSON reader was have several read modes. The least tolerant
> attempts to infer all data types. This seems like a great idea in practice,
> however when you start actually using Drill with real data, you start
> seeing the issues with this approach. The JSON reader has a few
> configuration options that increase its tolerance for bad data. The next
> level is readAllNumbersAsDouble which... as the name implies, reads all
> numeric data as Doubles and does not attempt to infer ints vs floats. The
> next options is allTextMode which reads all fields as VARCHAR. This
> should be used when the data is so inconsistent that it cannot be read with
> either mode. These modes can be set globally, at the plugin level or at
> query time.
>
> For the XML reader, I didn't add type inference because I figured the data
> would be quite messy, however it wouldn't be that hard to add basically the
> same levels as the JSON reader.
>
> This fundamental issue exists in all the readers that read data without a
> schema. My rationale for working on the XSD reader is that this will enable
> us to accurately read XML data with all the correct data types.
>
> —
> Reply to this email directly, view it on GitHub
> , or
> unsubscribe
> 
> .
> You are receiving this because you were mentioned.Message ID:
> 
>


Re: Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-09-13 Thread Mike Beckerle
... sound of crickets on a summer night .

It would really help me if I could get a response to this inquiry, to help
me better understand Drill.

I do realize people are busy, and also this was originally sent Aug 25,
which is the wrong time of year to get timely response to anything.
Hence, this re-up of the message.


On Fri, Aug 25, 2023 at 7:39 PM Mike Beckerle  wrote:

> Below is a small JSON output from Daffodil and below that is the same
> Infoset output as XML.
> (They're inline in this message, but I also attached them as files)
>
> This is just a parse of a small PCAP file with a few ICMP packets in it.
> It's an example DFDL schema used to illustrate binary file parsing.
>
> (The schema is here https://github.com/DFDLSchemas/PCAP which uses this
> component schema: https://github.com/DFDLSchemas/ethernetIP)
>
> My theory is that Drill queries against these should be identical to
> obtain the same output row contents.
> That is, since this data has the same schema, whether it is JSON or XML
> shouldn't affect how you query it.
> To do that the XML Reader will need the XML schema (or some hand-provided
> metadata) so it knows what is an array. (Specifically PCAP.Packet is the
> array.)
>
> E.g., if you wanted to get the IPSrc and IPDest fields in a table from all
> ICMP packets in this file, that query should be the same for the JSON and
> the XML data.
>
> First question: Does that make sense? I want to make sure I'm
> understanding this right.
>
> Second question, since I don't really understand Drill SQL yet.
>
> What is a query that would pluck the IPSrc.value and IPDest.value from
> this data and make a row of each pair of those?
>
> The top level is a map with a single element named PCAP.
> The "table" is PCAP.Packet which is an array (of maps).
> And within each array item's map the fields of interest are within
> LinkLayer.Ethernet.NetworkLayer.IPv4.IPv4Header
> (so maybe IPv4Header is the table?)
> The two fields within there are IPSrc.value (AS src) and IPDest.value (AS
> dest)
>
> I'm lost on how to tell the query that the table is the array PCAP.Packet,
> or the IPv4Header within those maybe?
>
> Maybe this is easy, but I'm just not grokking it yet so I could use some
> help here.
>
> Thanks in advance.
>
> {
> "PCAP": {
> "PCAPHeader": {
> "MagicNumber": "D4C3B2A1",
> "Version": {
> "Major": "2",
> "Minor": "4"
> },
> "Zone": "0",
> "SigFigs": "0",
> "SnapLen": "65535",
> "Network": "1"
> },
> "Packet": [
> {
> "PacketHeader": {
> "Seconds": "1371631556",
> "USeconds": "838904",
> "InclLen": "74",
> "OrigLen": "74"
> },
> "LinkLayer": {
> "Ethernet": {
> "MACDest": "005056E01449",
> "MACSrc": "000C29340BDE",
> "Ethertype": "2048",
> "NetworkLayer": {
> "IPv4": {
> "IPv4Header": {
> "Version": "4",
> "IHL": "5",
> "DSCP": "0",
> "ECN": "0",
> "Length": "60",
> "Identification": "55107",
> "Flags": "0",
> "FragmentOffset": "0",
> "TTL": "128",
> "Protocol": "1",
> "Checksum": "11123",
> "IPSrc": {
> "value": "192.168.158.139"
> },
> "IPDest": {
> "value": "174.137.42.77"
> },
> "ComputedChecksum": "11123"
> },
> "Protocol": "1",
> "ICMPv4": {
> "Type": "8",
> "Code": "0",
> "Checksum": "10844",
> "EchoRequest": {
> "Identifier": "512",
> "SequenceNumber": "8448",
> "Payload":
> "6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
> }
> }
> }
> }
> }
> }
> },
> {
> "PacketHeader": {
> "Seconds": "1371631557",
> "USeconds": "55699",
> "InclLen": "74",
> "OrigLen": "74"
> },
> "LinkLayer": {
> "Ethernet": {
> "MACDest": "000C29340BDE",
> "MACSrc": "005056E01449",
> "Ethertype": "2048",
&

Discuss: JSON and XML and Daffodil - same Infoset, same Query should create same rowset?

2023-08-25 Thread Mike Beckerle
"Identification": "30448",
"Flags": "0",
"FragmentOffset": "0",
"TTL": "128",
"Protocol": "1",
"Checksum": "35782",
"IPSrc": {
"value": "174.137.42.77"
},
"IPDest": {
"value": "192.168.158.139"
},
"ComputedChecksum": "35782"
},
"Protocol": "1",
"ICMPv4": {
"Type": "0",
"Code": "0",
"Checksum": "12380",
"EchoReply": {
"Identifier": "512",
"SequenceNumber": "8960",
"Payload":
"6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
}
}
}
}
}
}
},
{
"PacketHeader": {
"Seconds": "1371631559",
"USeconds": "841775",
"InclLen": "74",
"OrigLen": "74"
},
"LinkLayer": {
"Ethernet": {
"MACDest": "005056E01449",
"MACSrc": "000C29340BDE",
"Ethertype": "2048",
"NetworkLayer": {
"IPv4": {
"IPv4Header": {
"Version": "4",
"IHL": "5",
"DSCP": "0",
"ECN": "0",
"Length": "60",
"Identification": "55118",
"Flags": "0",
"FragmentOffset": "0",
"TTL": "128",
"Protocol": "1",
"Checksum": "2",
"IPSrc": {
"value": "192.168.158.139"
},
"IPDest": {
"value": "174.137.42.77"
},
"ComputedChecksum": "2"
},
"Protocol": "1",
"ICMPv4": {
"Type": "8",
"Code": "0",
"Checksum": "10076",
"EchoRequest": {
"Identifier": "512",
"SequenceNumber": "9216",
"Payload":
"6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
}
}
}
}
}
}
},
{
"PacketHeader": {
"Seconds": "1371631560",
"USeconds": "42354",
"InclLen": "74",
"OrigLen": "74"
},
"LinkLayer": {
"Ethernet": {
"MACDest": "000C29340BDE",
"MACSrc": "005056E01449",
"Ethertype": "2048",
"NetworkLayer": {
"IPv4": {
"IPv4Header": {
"Version": "4",
"IHL": "5",
"DSCP": "0",
"ECN": "0",
"Length": "60",
"Identification": "30453",
"Flags": "0",
"FragmentOffset": "0",
"TTL": "128",
"Protocol": "1",
"Checksum": "35777",
"IPSrc": {
"value": "174.137.42.77"
},
"IPDest": {
"value": "192.168.158.139"
},
"ComputedChecksum": "35777"
},
"Protocol": "1",
"ICMPv4": {
"Type": "0",
"Code": "0",
"Checksum": "12124",
"EchoReply": {
"Identifier": "512",
"SequenceNumber": "9216",
"Payload":
"6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869"
}
}
}
}
}
}
}
]
}
}




D4C3B2A1

2
4

0
0
65535
1



1371631556
838904
74
74



005056E01449
000C29340BDE
2048



4
5
0
0
60
55107
0
0
128
1
11123

192.168.158.139


174.137.42.77

11123

1

8
0
10844

512
8448
6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869










1371631557
55699
74
74



000C29340BDE
005056E01449
2048



4
5
0
0
60
30433
0
0
128
1
35797

174.137.42.77


192.168.158.139

35797

1

0
0
12892

512
8448
6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869










1371631557
840049
74
74



005056E01449
000C29340BDE
2048



4
5
0
0
60
55110
0
0
128
1
11120

192.168.158.139


174.137.42.77

11120

1

8
0
10588

512
8704
6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869










1371631558
44196
74
74



000C29340BDE
005056E01449
2048



4
5
0
0
60
30436
0
0
128
1
35794

174.137.42.77


192.168.158.139

35794

1

0
0
12636

512
8704
6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869










1371631558
841168
74
74



005056E01449
000C29340BDE
2048



4
5
0
0
60
55113
0
0
128
1
7

192.168.158.139


174.137.42.77

7

1

8
0
10332

512
8960
6162636465666768696A6B6C6D6E6F7071727374757677616263646566676869










1371631559
85428
74
74



000C29340BDE
005056E01449
2048



4
5
0
0
60
30448
0
0
128
1
35782

174.137.

Re: drill tests not passing

2023-08-25 Thread Mike Beckerle
Thank you for your help.

To review,  this command works and succeeds:

mvn clean install -DskipTests=true

mvn test ... with args per your prior email...

fails with the "drill-java-exec: Artifact has not been packaged yet..."
error.

On Fri, Aug 25, 2023 at 3:54 AM James Turton  wrote:

> Okay, that attempt couldn't even build java-exec. What does the
> following do when run from the root of the source tree?"
>
> mvn package -f exec/vector/pom.xml
>

The above builds successfully.

The below builds, but when executing tests, huge number of errors. File
attached.


> mvn package -f exec/java-exec/pom.xml
>
> On 2023/08/24 17:09, Mike Beckerle wrote:
> >
> > Still no luck.
> >
> > Output from your mvn-test-drill command is attached.
> >
> > drill-java-exec: Artifact has not been packaged yet. 
> >
> >
> > On Mon, Aug 21, 2023 at 10:36 AM James Turton  > <mailto:dz...@apache.org>> wrote:
> >
> > __
> > Hi Mike
> >
> > I took a look at the build log that you shared with me recently and
> > what happened in that run was OOM (first report on line 9348).
> >
> > |   9355 2198813 ERROR [UserServer-1]
> > [org.apache.drill.exec.rpc.RpcExceptionHandler] - Exception in RPC
> > communication.9355  Connection: /127.0.0.1:31046
> > <http://127.0.0.1:31046> <--> /127.0.0.1:39558
> > <http://127.0.0.1:39558> (user server).  Closing connection.
> > 9356 java.lang.OutOfMemoryError: GC overhead limit exceeded
> > |
> > My command to run Drill tests set memory limits as follows.
> >
> > |alias mvn-test-drill="mvn test \
> >  -Djunit.args=\"-Duser.timezone=UTC -Duser.language=en
> > -Duser.region=US\" \
> >  -DmemoryMb=2560 -DdirectMemoryMb=2560 \
> >  -DforkCount=2"
> > |
> > Hope this helps...
> > James
> >
> > On 2023/07/25 00:00, Mike Beckerle wrote:
> >> Hi drill devs,
> >>
> >> I'm still stuck on this problem. Can anyone suggest a way past this?
> >>
> >> Mike Beckerle
> >> Apache Daffodil PMC |daffodil.apache.org  <
> http://daffodil.apache.org>
> >> OGF DFDL Workgroup Co-Chair |
> www.ogf.org/ogf/doku.php/standards/dfdl/dfdl  <
> http://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl>
> >> Owl Cyber Defense |www.owlcyberdefense.com  <
> http://www.owlcyberdefense.com>
> >>
> >>
> >>
> >> On Mon, Jul 17, 2023 at 9:53 AM Mike Beckerle
> <mailto:mbecke...@apache.org>  wrote:
> >>
> >>> Looks like I attached the wrong file. Doing too many things at
> once.
> >>>
> >>> The correct file is attached here.
> >>>
> >>>
> >>>
> >>> On Fri, Jul 14, 2023 at 2:04 PM Mike Beckerle
> <mailto:mbecke...@apache.org>
> >>> wrote:
> >>>
> >>>> Update: I did a clean and install -DskipTests=true.
> >>>>
> >>>> Then I tried the mvn test using the non-UTC timezone stuff, as
> suggested.
> >>>>
> >>>> But alas, it still fails, this time the failure unique and is
> only in
> >>>> "Java Execution Engine"
> >>>>
> >>>> [ERROR] Failed to execute goal
> >>>> org.apache.maven.plugins:maven-dependency-plugin:3.4.0:unpack
> >>>> (unpack-vector-types) on project drill-java-exec: Artifact has
> not been
> >>>> packaged yet. When used on reactor artifact, unpack should be
> executed
> >>>> after packaging: see MDEP-98. -> [Help 1]
> >>>>
> >>>> The command and complete trace output are below.
> >>>>
> >>>> I need assistance on how to proceed.
> >>>>
> >>>> Complete trace from the mvn test is attached.
> >>>>
> >>>>
> >>>> On Thu, Jul 13, 2023 at 1:13 PM Mike Beckerle<
> mbecke...@apache.org>  <mailto:mbecke...@apache.org>
> >>>> wrote:
> >>>>
> >>>>> To answer questions:
> >>>>>
> >>>>> 1. Paul: This is a 100% stock build. All I have done is clone
> the repo
> >>>>> (master branch). Make a new git branch (in case I make future
> changes). Try
> >>>>> to build (success) and test (failed 

Re: drill tests not passing

2023-08-24 Thread Mike Beckerle
Still no luck.

Output from your mvn-test-drill command is attached.

drill-java-exec: Artifact has not been packaged yet. 


On Mon, Aug 21, 2023 at 10:36 AM James Turton  wrote:

> Hi Mike
>
> I took a look at the build log that you shared with me recently and what
> happened in that run was OOM (first report on line 9348).
>
>9355 2198813 ERROR [UserServer-1]
> [org.apache.drill.exec.rpc.RpcExceptionHandler] - Exception in RPC
> communication.9355  Connection: /127.0.0.1:31046 <--> /127.0.0.1:39558
> (user server).  Closing connection.
>9356 java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> My command to run Drill tests set memory limits as follows.
>
> alias mvn-test-drill="mvn test \
> -Djunit.args=\"-Duser.timezone=UTC -Duser.language=en
> -Duser.region=US\" \
> -DmemoryMb=2560 -DdirectMemoryMb=2560 \
> -DforkCount=2"
>
> Hope this helps...
> James
>
> On 2023/07/25 00:00, Mike Beckerle wrote:
>
> Hi drill devs,
>
> I'm still stuck on this problem. Can anyone suggest a way past this?
>
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com
>
>
>
> On Mon, Jul 17, 2023 at 9:53 AM Mike Beckerle  
>  wrote:
>
>
> Looks like I attached the wrong file. Doing too many things at once.
>
> The correct file is attached here.
>
>
>
> On Fri, Jul 14, 2023 at 2:04 PM Mike Beckerle  
> 
> wrote:
>
>
> Update: I did a clean and install -DskipTests=true.
>
> Then I tried the mvn test using the non-UTC timezone stuff, as suggested.
>
> But alas, it still fails, this time the failure unique and is only in
> "Java Execution Engine"
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-dependency-plugin:3.4.0:unpack
> (unpack-vector-types) on project drill-java-exec: Artifact has not been
> packaged yet. When used on reactor artifact, unpack should be executed
> after packaging: see MDEP-98. -> [Help 1]
>
> The command and complete trace output are below.
>
> I need assistance on how to proceed.
>
> Complete trace from the mvn test is attached.
>
>
> On Thu, Jul 13, 2023 at 1:13 PM Mike Beckerle  
> 
> wrote:
>
>
> To answer questions:
>
> 1. Paul: This is a 100% stock build. All I have done is clone the repo
> (master branch). Make a new git branch (in case I make future changes). Try
> to build (success) and test (failed so far).
>
> 2. James: The /opt/drill directory I created is owned by my userid and
> has full read/write access for all the development activities. I just put
> it there so it would have a shorter path to fix the first Hive-related
> glitch I encountered with the Linux 255 limit on file pathname length.
>
> I will try the suggested maven command line for non-UTC and see if
> things improve.
>
> The challenge for me as a newby is how do I know if I have everything
> properly configured?
>
> Can I just turn off building and testing of the Hive-related stuff in
> some supported/well-known way?
>
> If so, I would suggest I'd like to turn off not just Hive, but *as much
> as possible*. I really just need the embedded drill to work.
>
> I would agree with @Charles Givrethat 
> a contrib
> package addition is the ideal approach and that's what I'll be attempting.
>
> -mikeb
>
> On Thu, Jul 13, 2023 at 10:59 AM Charles Givre  
>  wrote:
>
>
> I'll add some heresy here... IMHO, for the purposes of developing a
> DFDL extension, you probably don't need all the Drill tests to run.  For
> your project, my suggestion would be to add a module to the contrib package
> and that way your changes are relatively self contained.
> Best,
> -- C
>
>
>
>
> On Jul 13, 2023, at 10:27 AM, James Turton  
>  wrote:
>
> Hi Mike
>
> Here's the command line I use to run tests on a machine that's not in
>
> the UTC time zone (plus some unrelated memory size arguments).
>
> mvn test -Djunit.args="-Duser.timezone=UTC -Duser.language=en
>
> -Duser.region=US" -DmemoryMb=2560 -DdirectMemoryMb=2560
>
> I have one other question to add to Paul's comments - does the OS
>
> user that you're running Maven under have write access to all of the source
> tree that you put at /opt/drill?
>
> On 2023/07/11 22:12, Paul Rogers wrote:
>
> Hi Mike,
>
> A quick glance at the log suggests a failure in the tests for the
>
> JSON
>
> reader, in the Mongo extended types. Drill's date/time support has
> historically been fragile. Some tests only work if your machine is
>
&g

Drill SQL questions - JSON context

2023-08-18 Thread Mike Beckerle
I'm using Apache Daffodil in the mode where it outputs JSON data. (For the
moment, until we build a tighter integration. This is my conceptual test
framework for that integration.)

I have parsed data to create this JSON which represents 2-level nested
repeating subrecords.

All the simple fields are int.

[{"a":1,  "b":2,  "c":[{"d":3,  "e":4,  "f":[{"g":5,  "h":6 },
 {"g":7,  "h":8 }]},
   {"d":9,  "e":10, "f":[{"g":11, "h":12},
 {"g":13, "h":14}]}]},
 {"a":21, "b":22, "c":[{"d":23, "e":24, "f":[{"g":25, "h":26 },
 {"g":27, "h":28 }]},
   {"d":29, "e":30, "f":[{"g":31, "h":32},
 {"g":33, "h":34}]}]}]

So, the top level is a vector of maps,
within that, field "c" is a vector of maps,
and within "c" is a field f which is a vector of maps.

The reason I created this is I'm trying to understand the arrays and how
they work with Drill SQL.

I'm trying to figure out how to get this rowset of 3 rows from a query, and
I'm stumped.

  a   b   d   e   g   h
( 1,  2,  3,  4,  5,  6)
( 1,  2,  9, 10, 13, 14)
(21, 22, 29, 30, 33, 34)

This is the SQL that is my conceptual framework, but I'm sure it won't work.

SELECT a, b, c.d AS d, c.e AS e, c.f.g AS g, c.f.h AS h
FROM ... the json file...
WHERE g mod 10 == 3 OR g == 5

But I know it's not going to be that easy to get the query to traverse the
vector inside the vector.

>From the doc, the FLATTEN operator seems to be needed, but I can't really
figure it out.

This is what all my data is like. Trees of nested vectors of sub-records.

Can anyone advise on what the SQL might look like, or where there's an
example doing something like this I can learn from?

Thanks for any help

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com


is there a way to provide inline array metadata to inform the xml_reader?

2023-08-14 Thread Mike Beckerle
I'm trying to get my Drill SQL queries to produce the right thing from XML.

A major thing that you can't easily infer from looking at just XML data is
what is an array. XML lacks an array starting indicator.

Is there an inline schema notation in the Drill Query language for
array-ness, so that one can inform Drill what is an array?

For example this provides simple types for all the fields directly in the
query.

@Test

public void testSimpleProvidedSchema() throws Exception {

  String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml`
(type => 'xml', schema " +

"=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field`
FLOAT, `double_field` DOUBLE, `boolean_field` " +

"BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field`
TIMESTAMP, `string_field`" +

" VARCHAR, `date2_field` DATE properties {`drill.format` =
`MM/dd/`})'))";

  RowSet results = client.queryBuilder().sql(sql).rowSet();

  assertEquals(2, results.rowCount());


Can one also tell Drill what fields or child elements are arrays?


checkstyle problems

2023-08-08 Thread Mike Beckerle
I'm getting a crash on checkstyle. I can't figure out what the remaining 2
checkstyle errors are, and checkstyle is failing to create its output
report.

Does anybody have a clue for me on how to proceed?

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-checkstyle-plugin:3.1.1:checkstyle
(default-cli) on project drill-format-xml: An error has occurred in
Checkstyle report generation.: Failed during checkstyle execution: There
are 2 errors reported by Checkstyle 10.7.0 with
src/main/resources/checkstyle-config.xml ruleset. -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
goal org.apache.maven.plugins:maven-checkstyle-plugin:3.1.1:checkstyle
(default-cli) on project drill-format-xml: An error has occurred in
Checkstyle report generation.
at org.apache.maven.lifecycle.internal.MojoExecutor.execute
(MojoExecutor.java:215)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute
(MojoExecutor.java:156)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute
(MojoExecutor.java:148)


Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com


UserBitShared.proto question

2023-08-08 Thread Mike Beckerle
So is UserBitShared.java generated from UserBitShared.proto ?

It looks like it is, but mvn clean install -DskipTests=true doesn't seem to
cause it to be regenerated.

What do I do to cause the regeneration?

Right now I've edited both files to add a new ErrorType.SCHEMA, but I think
I should only have to edit in one spot.


Drill representation of XML Complex Type with Simple Content

2023-08-04 Thread Mike Beckerle
Consider this XML:


  A
  B
  Y
  C


And this drill query:

SELECT * FROM cp.`xml/foo.xml`

I am using datalevel = 1.

The results I get (calling RowSet results.print() in my junit test) are:

#: `attributes` STRUCT<`int1_x` VARCHAR, `int1_y` VARCHAR, `char1_y`
VARCHAR>, `int1` VARCHAR, `char1` VARCHAR
0: {"27", "3", "4"}, "ABY", "C"

So questions:

First, why is it constructing 1 row, not multiple?

The only way I expect to get only 1 row out is if I did a group-by with the
whole row-set having only 1 key value.

Second, why is it concatenating the value strings?

I'd expect to write like: "SELECT '1' AS key, * FROM ...theTable... GROUP
BY key", and only then would I expect concatenation if everything is a
string and concat is somehow the default grouping operation. Even then it's
a stretch.

Here's what I expected to get out after inspecting the schema that was
inferred from the data:

0: {"2", null, null}, "A", null
1: {"7", null, null}, "B", null
2: {null, "3", null}, "Y", null
3: {null, null, "4"}, null, "C"

Those correspond to the 3 columns "attributes", "int1", "char1", where
attributes is itself { int1_x, int1_y, char1_y}.

Third, how would I change my query to get out what I expect?

Lastly, what is the rationale for the name "int1_x" (also int1_y, and
char1_y) ?
I expected to see two separate attributes columns: "attributes_int1" and
"attributes_char1" as maps with non-prefixed children named  x, y and y
respectively.

I guess I just don't grok the rationale for how queries work against XML.

The natural XML schema for this XML document is:


  

  

  

  

  


  

  

  

  


  


  

  





I need to synthesize the same TupleMetadata from this schema that the
current XML reader infers incrementally, so I really need to understand the
rationale, because I wouldn't expect this choice to be entirely flattened
including the attributes.

Thanks for any help

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com


Attending Community over Code 2023 ?

2023-08-04 Thread Mike Beckerle
Is anyone from the drill project planning to attend the Community over Code
2023 conference?


>
>


Drill+Daffodil ... Fwd: Talk not accepted for Community over Code 2023

2023-08-03 Thread Mike Beckerle
Well that's unfortunate, but I still want to do the integration.

...mike beckerle


-- Forwarded message -
From: 
Date: Thu, Aug 3, 2023 at 2:18 PM
Subject: Talk not accepted for Community over Code 2023
To: 



Mike Beckerle

Unfortunately, your talk
Direct Query of Arbitrary Data Formats using Apache Drill and Apache
Daffodil
has not been selected for Community over Code 2023.

We always receive more talks than we have room for, and the selection is
always difficult. We hope you will consider submitting a talk again next
year.

Rich, for the event planners


xml_reader branch - and getting Drill to use the XSD-derived metadata

2023-07-31 Thread Mike Beckerle
I added a first cut at attribute support -
https://github.com/cgivre/drill/pull/6

Ok, so given that the xsd_reader can now map a small usable subset of XSD
to drill metadata, it seems next we want Drill to start using this
metadata.

I am not sure where to start here. Where would the metadata from the XSD
plug into the query planning/building?

Can we schedule a time to discuss?

I am broadly available the rest of this week. Tues at 1pm or 3pm, any time
the rest of the week, but sooner is better as I also have time to work on
this currently.

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com


Re: drill tests not passing

2023-07-31 Thread Mike Beckerle
Charles,

fixed your testComplexXSD in your xsd_reader branch.

https://github.com/cgivre/drill/pull/5

Need to add attribute support, but this is quite close.




On Fri, Jul 14, 2023 at 5:59 PM Charles Givre  wrote:

> Hi Mike,
> One more thing... I've been working on an XSD Reader for Drill for some
> time.  (This is still very buggy)
> https://github.com/cgivre/drill/tree/xsd_reader
>
>  What this does is attempt to convert a XML XSD file into a Drill Schema.
> Best,
> -- C
>
>
>
> On Jul 14, 2023, at 2:20 PM, Charles Givre  wrote:
>
> Mike,
> Are you able to build Drill w/o the tests?  If so, my suggestion is really
> just to start working on the DFDL extensions.  I've been doing Drill stuff
> for far too long and really haven't needed to run the full battery of unit
> tests locally.  As long as you can build it and can execute individual unit
> tests, you should be ok.  Others may disagree, but for what you're doing,
> I'd think it would be fine.
> Best,
> -- C
>
>
>
> On Jul 14, 2023, at 2:04 PM, Mike Beckerle  wrote:
>
> Update: I did a clean and install -DskipTests=true.
>
> Then I tried the mvn test using the non-UTC timezone stuff, as suggested.
>
> But alas, it still fails, this time the failure unique and is only in
> "Java Execution Engine"
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-dependency-plugin:3.4.0:unpack
> (unpack-vector-types) on project drill-java-exec: Artifact has not been
> packaged yet. When used on reactor artifact, unpack should be executed
> after packaging: see MDEP-98. -> [Help 1]
>
> The command and complete trace output are below.
>
> I need assistance on how to proceed.
>
> Complete trace from the mvn test is attached.
>
>
> On Thu, Jul 13, 2023 at 1:13 PM Mike Beckerle 
> wrote:
>
>> To answer questions:
>>
>> 1. Paul: This is a 100% stock build. All I have done is clone the repo
>> (master branch). Make a new git branch (in case I make future changes). Try
>> to build (success) and test (failed so far).
>>
>> 2. James: The /opt/drill directory I created is owned by my userid and
>> has full read/write access for all the development activities. I just put
>> it there so it would have a shorter path to fix the first Hive-related
>> glitch I encountered with the Linux 255 limit on file pathname length.
>>
>> I will try the suggested maven command line for non-UTC and see if things
>> improve.
>>
>> The challenge for me as a newby is how do I know if I have everything
>> properly configured?
>>
>> Can I just turn off building and testing of the Hive-related stuff in
>> some supported/well-known way?
>>
>> If so, I would suggest I'd like to turn off not just Hive, but *as much
>> as possible*. I really just need the embedded drill to work.
>>
>> I would agree with @Charles Givre   that a contrib
>> package addition is the ideal approach and that's what I'll be attempting.
>>
>> -mikeb
>>
>> On Thu, Jul 13, 2023 at 10:59 AM Charles Givre  wrote:
>>
>>> I'll add some heresy here... IMHO, for the purposes of developing a DFDL
>>> extension, you probably don't need all the Drill tests to run.  For your
>>> project, my suggestion would be to add a module to the contrib package and
>>> that way your changes are relatively self contained.
>>> Best,
>>> -- C
>>>
>>>
>>>
>>> > On Jul 13, 2023, at 10:27 AM, James Turton  wrote:
>>> >
>>> > Hi Mike
>>> >
>>> > Here's the command line I use to run tests on a machine that's not in
>>> the UTC time zone (plus some unrelated memory size arguments).
>>> >
>>> > mvn test -Djunit.args="-Duser.timezone=UTC -Duser.language=en
>>> -Duser.region=US" -DmemoryMb=2560 -DdirectMemoryMb=2560
>>> >
>>> > I have one other question to add to Paul's comments - does the OS user
>>> that you're running Maven under have write access to all of the source tree
>>> that you put at /opt/drill?
>>> >
>>> > On 2023/07/11 22:12, Paul Rogers wrote:
>>> >> Hi Mike,
>>> >>
>>> >> A quick glance at the log suggests a failure in the tests for the JSON
>>> >> reader, in the Mongo extended types. Drill's date/time support has
>>> >> historically been fragile. Some tests only work if your machine is
>>> set to
>>> >> use the UTC time zone (or Java is told to pretend that the time is
>>> UTC.)
>>> >> The Mongo types test fail

Re: drill tests not passing

2023-07-25 Thread Mike Beckerle
Figured out my own issue. This --add-opens is a java 9+ thing. I'll switch
from Java 8 to Java 11.

On Mon, Jul 24, 2023 at 7:07 PM Mike Beckerle  wrote:

> Charles,
>
> When you say this is close to working what is the expected behavior of
> the code currently.
> I could debug into this, but I'm frankly unable to get anything to run.
>
> Currently when I try to run just one test: TestXSDSchema:testSimpleXSD(),
> I get the below giant message, to cut to the chase ends in:
>
> Unrecognized option: --add-opens
> Error: Could not create the Java Virtual Machine.
> Error: A fatal exception has occurred. Program will exit.
>
> Process finished with exit code 1
>
>
> ---
>
> /home/mbeckerle/installed-software/jdk1.8.0_361/bin/java -ea
> -Djava.io.tmpdir=/opt/drill/contrib/format-xml/target -Xms512m -Xmx2500m
> -Ddrill.exec.http.enabled=false
> -Ddrill.exec.memory.enable_unsafe_bounds_check=true
> -Ddrill.exec.sys.store.provider.local.write=false
> -Dorg.apache.drill.exec.server.Drillbit.system_options=org.apache.drill.exec.compile.ClassTransformer.scalar_replacement=on
> -Ddrill.catastrophic_to_standard_out=true -XX:MaxDirectMemorySize=4500M
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true -ea --add-opens
> java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.net=ALL-UNNAMED
> --add-opens java.base/java.nio=ALL-UNNAMED --add-opens
> java.base/java.util=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED
> --add-opens java.security.jgss/sun.security.krb5=ALL-UNNAMED
> -Djdk.attach.allowAttachSelf=true
> -javaagent:/home/mbeckerle/.m2/repository/org/jmockit/jmockit/1.47/jmockit-1.47.jar
> -Didea.test.cyclic.buffer.size=1048576
> -javaagent:/home/mbeckerle/installed-software/idea-IU-231.8770.65/lib/idea_rt.jar=46065:/home/mbeckerle/installed-software/idea-IU-231.8770.65/bin
> -Dfile.encoding=UTF-8 -classpath
> /home/mbeckerle/installed-software/idea-IU-231.8770.65/lib/idea_rt.jar:/home/mbeckerle/installed-software/idea-IU-231.8770.65/plugins/junit/lib/junit5-rt.jar:/home/mbeckerle/installed-software/idea-IU-231.8770.65/plugins/junit/lib/junit-rt.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/charsets.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/deploy.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/ext/cldrdata.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/ext/dnsns.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/ext/jaccess.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/ext/jfxrt.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/ext/localedata.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/ext/nashorn.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/ext/sunec.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/ext/sunjce_provider.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/ext/sunpkcs11.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/ext/zipfs.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/javaws.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/jce.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/jfr.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/jfxswt.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/jsse.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/management-agent.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/plugin.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/resources.jar:/home/mbeckerle/installed-software/jdk1.8.0_361/jre/lib/rt.jar:/opt/drill/contrib/format-xml/target/test-classes:/opt/drill/contrib/format-xml/target/classes:/opt/drill/exec/java-exec/target/classes:/home/mbeckerle/.m2/repository/org/apache/httpcomponents/httpasyncclient/4.1.4/httpasyncclient-4.1.4.jar:/home/mbeckerle/.m2/repository/org/apache/httpcomponents/httpcore/4.4.10/httpcore-4.4.10.jar:/home/mbeckerle/.m2/repository/org/apache/httpcomponents/httpcore-nio/4.4.10/httpcore-nio-4.4.10.jar:/home/mbeckerle/.m2/repository/org/apache/httpcomponents/httpclient/4.5.13/httpclient-4.5.13.jar:/home/mbeckerle/.m2/repository/org/owasp/encoder/encoder/1.2.3/encoder-1.2.3.jar:/home/mbeckerle/.m2/repository/org/ow2/asm/asm-commons/9.2/asm-commons-9.2.jar:/home/mbeckerle/.m2/repository/org/ow2/asm/asm/9.2/asm-9.2.jar:/home/mbeckerle/.m2/repository/org/ow2/asm/asm-tree/9.2/asm-tree-9.2.jar:/home/mbeckerle/.m2/repository/org/ow2/asm/asm-analysis/9.2/asm-analysis-9.2.jar:/home/mbeckerle/.m2/repository/org/ow2/asm/asm-util/9.2/asm-util-9.2.jar:/home/mbeckerle/.m2/repository/com/dropbox/core/dropbox-core-sdk/5.4.4/dropbox-core-sdk-5.4.4.jar:/home/mbeckerle/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.14.3/jackson-core-2.14.3.jar:/home/mbeckerle/.m2/repository/com/box/box-java-sdk/3

Re: drill tests not passing

2023-07-24 Thread Mike Beckerle
:/home/mbeckerle/.m2/repository/de/huxhorn/lilith/de.huxhorn.lilith.logback.converter-classic/8.3.0/de.huxhorn.lilith.logback.converter-classic-8.3.0.jar:/home/mbeckerle/.m2/repository/de/huxhorn/lilith/de.huxhorn.lilith.data.converter/8.3.0/de.huxhorn.lilith.data.converter-8.3.0.jar:/home/mbeckerle/.m2/repository/de/huxhorn/lilith/de.huxhorn.lilith.logback.classic/8.3.0/de.huxhorn.lilith.logback.classic-8.3.0.jar:/home/mbeckerle/.m2/repository/de/huxhorn/lilith/de.huxhorn.lilith.logback.appender.multiplex-core/8.3.0/de.huxhorn.lilith.logback.appender.multiplex-core-8.3.0.jar:/home/mbeckerle/.m2/repository/de/huxhorn/sulky/de.huxhorn.sulky.ulid/8.3.0/de.huxhorn.sulky.ulid-8.3.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-client/1.0.0/kerb-client-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerby-config/1.0.0/kerby-config-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-common/1.0.0/kerb-common-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-crypto/1.0.0/kerb-crypto-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-util/1.0.0/kerb-util-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-core/1.0.0/kerb-core-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerby-pkix/1.0.0/kerby-pkix-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerby-asn1/1.0.0/kerby-asn1-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerby-util/1.0.0/kerby-util-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-simplekdc/1.0.0/kerb-simplekdc-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-admin/1.0.0/kerb-admin-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-server/1.0.0/kerb-server-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-identity/1.0.0/kerb-identity-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerby-xdr/1.0.0/kerby-xdr-1.0.0.jar:exec/jdbc/src/test/resources/storage-plugins.json
com.intellij.rt.junit.JUnitStarter -ideVersion5 -junit4
org.apache.drill.exec.store.xml.xsd.TestXSDSchema,testSimpleXSD
Unrecognized option: --add-opens
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

Process finished with exit code 1

On Mon, Jul 17, 2023 at 9:08 AM Mike Beckerle  wrote:

> That looks like a great start at what we will need for DFDL.
>
> I will study what you've done carefully and get back to you with
> questions.
>
>
> On Fri, Jul 14, 2023 at 5:59 PM Charles Givre  wrote:
>
>> Hi Mike,
>> One more thing... I've been working on an XSD Reader for Drill for some
>> time.  (This is still very buggy)
>> https://github.com/cgivre/drill/tree/xsd_reader
>>
>>  What this does is attempt to convert a XML XSD file into a Drill Schema.
>>
>> Best,
>> -- C
>>
>>
>>
>> On Jul 14, 2023, at 2:20 PM, Charles Givre  wrote:
>>
>> Mike,
>> Are you able to build Drill w/o the tests?  If so, my suggestion is
>> really just to start working on the DFDL extensions.  I've been doing Drill
>> stuff for far too long and really haven't needed to run the full battery of
>> unit tests locally.  As long as you can build it and can execute individual
>> unit tests, you should be ok.  Others may disagree, but for what you're
>> doing, I'd think it would be fine.
>> Best,
>> -- C
>>
>>
>>
>> On Jul 14, 2023, at 2:04 PM, Mike Beckerle  wrote:
>>
>> Update: I did a clean and install -DskipTests=true.
>>
>> Then I tried the mvn test using the non-UTC timezone stuff, as suggested.
>>
>> But alas, it still fails, this time the failure unique and is only in
>> "Java Execution Engine"
>>
>> [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-dependency-plugin:3.4.0:unpack
>> (unpack-vector-types) on project drill-java-exec: Artifact has not been
>> packaged yet. When used on reactor artifact, unpack should be executed
>> after packaging: see MDEP-98. -> [Help 1]
>>
>> The command and complete trace output are below.
>>
>> I need assistance on how to proceed.
>>
>> Complete trace from the mvn test is attached.
>>
>>
>> On Thu, Jul 13, 2023 at 1:13 PM Mike Beckerle 
>> wrote:
>>
>>> To answer questions:
>>>
>>> 1. Paul: This is a 100% stock build. All I have done is clone the repo
>>> (master branch). Make a new git branch (in case I make future changes). Try
>>> to build (success) and test (failed so far).
>>>
>>> 2. James: The /opt/drill directory I created is owned by my userid and
>>> has full read/write access for all the development activities. I just put
>>> it there so it would have a shorter path to fix the fir

Re: drill tests not passing

2023-07-24 Thread Mike Beckerle
Hi drill devs,

I'm still stuck on this problem. Can anyone suggest a way past this?

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com



On Mon, Jul 17, 2023 at 9:53 AM Mike Beckerle  wrote:

> Looks like I attached the wrong file. Doing too many things at once.
>
> The correct file is attached here.
>
>
>
> On Fri, Jul 14, 2023 at 2:04 PM Mike Beckerle 
> wrote:
>
>> Update: I did a clean and install -DskipTests=true.
>>
>> Then I tried the mvn test using the non-UTC timezone stuff, as suggested.
>>
>> But alas, it still fails, this time the failure unique and is only in
>> "Java Execution Engine"
>>
>> [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-dependency-plugin:3.4.0:unpack
>> (unpack-vector-types) on project drill-java-exec: Artifact has not been
>> packaged yet. When used on reactor artifact, unpack should be executed
>> after packaging: see MDEP-98. -> [Help 1]
>>
>> The command and complete trace output are below.
>>
>> I need assistance on how to proceed.
>>
>> Complete trace from the mvn test is attached.
>>
>>
>> On Thu, Jul 13, 2023 at 1:13 PM Mike Beckerle 
>> wrote:
>>
>>> To answer questions:
>>>
>>> 1. Paul: This is a 100% stock build. All I have done is clone the repo
>>> (master branch). Make a new git branch (in case I make future changes). Try
>>> to build (success) and test (failed so far).
>>>
>>> 2. James: The /opt/drill directory I created is owned by my userid and
>>> has full read/write access for all the development activities. I just put
>>> it there so it would have a shorter path to fix the first Hive-related
>>> glitch I encountered with the Linux 255 limit on file pathname length.
>>>
>>> I will try the suggested maven command line for non-UTC and see if
>>> things improve.
>>>
>>> The challenge for me as a newby is how do I know if I have everything
>>> properly configured?
>>>
>>> Can I just turn off building and testing of the Hive-related stuff in
>>> some supported/well-known way?
>>>
>>> If so, I would suggest I'd like to turn off not just Hive, but *as much
>>> as possible*. I really just need the embedded drill to work.
>>>
>>> I would agree with @Charles Givre   that a contrib
>>> package addition is the ideal approach and that's what I'll be attempting.
>>>
>>> -mikeb
>>>
>>> On Thu, Jul 13, 2023 at 10:59 AM Charles Givre  wrote:
>>>
>>>> I'll add some heresy here... IMHO, for the purposes of developing a
>>>> DFDL extension, you probably don't need all the Drill tests to run.  For
>>>> your project, my suggestion would be to add a module to the contrib package
>>>> and that way your changes are relatively self contained.
>>>> Best,
>>>> -- C
>>>>
>>>>
>>>>
>>>> > On Jul 13, 2023, at 10:27 AM, James Turton  wrote:
>>>> >
>>>> > Hi Mike
>>>> >
>>>> > Here's the command line I use to run tests on a machine that's not in
>>>> the UTC time zone (plus some unrelated memory size arguments).
>>>> >
>>>> > mvn test -Djunit.args="-Duser.timezone=UTC -Duser.language=en
>>>> -Duser.region=US" -DmemoryMb=2560 -DdirectMemoryMb=2560
>>>> >
>>>> > I have one other question to add to Paul's comments - does the OS
>>>> user that you're running Maven under have write access to all of the source
>>>> tree that you put at /opt/drill?
>>>> >
>>>> > On 2023/07/11 22:12, Paul Rogers wrote:
>>>> >> Hi Mike,
>>>> >>
>>>> >> A quick glance at the log suggests a failure in the tests for the
>>>> JSON
>>>> >> reader, in the Mongo extended types. Drill's date/time support has
>>>> >> historically been fragile. Some tests only work if your machine is
>>>> set to
>>>> >> use the UTC time zone (or Java is told to pretend that the time is
>>>> UTC.)
>>>> >> The Mongo types test failure seems to be around a date/time test so
>>>> maybe
>>>> >> this is the issue?
>>>> >>
>>>> >> There are also failures indicating that the Drillbit (Drill server)
>>>> d

Re: drill tests not passing

2023-07-17 Thread Mike Beckerle
That looks like a great start at what we will need for DFDL.

I will study what you've done carefully and get back to you with questions.


On Fri, Jul 14, 2023 at 5:59 PM Charles Givre  wrote:

> Hi Mike,
> One more thing... I've been working on an XSD Reader for Drill for some
> time.  (This is still very buggy)
> https://github.com/cgivre/drill/tree/xsd_reader
>
>  What this does is attempt to convert a XML XSD file into a Drill Schema.
> Best,
> -- C
>
>
>
> On Jul 14, 2023, at 2:20 PM, Charles Givre  wrote:
>
> Mike,
> Are you able to build Drill w/o the tests?  If so, my suggestion is really
> just to start working on the DFDL extensions.  I've been doing Drill stuff
> for far too long and really haven't needed to run the full battery of unit
> tests locally.  As long as you can build it and can execute individual unit
> tests, you should be ok.  Others may disagree, but for what you're doing,
> I'd think it would be fine.
> Best,
> -- C
>
>
>
> On Jul 14, 2023, at 2:04 PM, Mike Beckerle  wrote:
>
> Update: I did a clean and install -DskipTests=true.
>
> Then I tried the mvn test using the non-UTC timezone stuff, as suggested.
>
> But alas, it still fails, this time the failure unique and is only in
> "Java Execution Engine"
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-dependency-plugin:3.4.0:unpack
> (unpack-vector-types) on project drill-java-exec: Artifact has not been
> packaged yet. When used on reactor artifact, unpack should be executed
> after packaging: see MDEP-98. -> [Help 1]
>
> The command and complete trace output are below.
>
> I need assistance on how to proceed.
>
> Complete trace from the mvn test is attached.
>
>
> On Thu, Jul 13, 2023 at 1:13 PM Mike Beckerle 
> wrote:
>
>> To answer questions:
>>
>> 1. Paul: This is a 100% stock build. All I have done is clone the repo
>> (master branch). Make a new git branch (in case I make future changes). Try
>> to build (success) and test (failed so far).
>>
>> 2. James: The /opt/drill directory I created is owned by my userid and
>> has full read/write access for all the development activities. I just put
>> it there so it would have a shorter path to fix the first Hive-related
>> glitch I encountered with the Linux 255 limit on file pathname length.
>>
>> I will try the suggested maven command line for non-UTC and see if things
>> improve.
>>
>> The challenge for me as a newby is how do I know if I have everything
>> properly configured?
>>
>> Can I just turn off building and testing of the Hive-related stuff in
>> some supported/well-known way?
>>
>> If so, I would suggest I'd like to turn off not just Hive, but *as much
>> as possible*. I really just need the embedded drill to work.
>>
>> I would agree with @Charles Givre   that a contrib
>> package addition is the ideal approach and that's what I'll be attempting.
>>
>> -mikeb
>>
>> On Thu, Jul 13, 2023 at 10:59 AM Charles Givre  wrote:
>>
>>> I'll add some heresy here... IMHO, for the purposes of developing a DFDL
>>> extension, you probably don't need all the Drill tests to run.  For your
>>> project, my suggestion would be to add a module to the contrib package and
>>> that way your changes are relatively self contained.
>>> Best,
>>> -- C
>>>
>>>
>>>
>>> > On Jul 13, 2023, at 10:27 AM, James Turton  wrote:
>>> >
>>> > Hi Mike
>>> >
>>> > Here's the command line I use to run tests on a machine that's not in
>>> the UTC time zone (plus some unrelated memory size arguments).
>>> >
>>> > mvn test -Djunit.args="-Duser.timezone=UTC -Duser.language=en
>>> -Duser.region=US" -DmemoryMb=2560 -DdirectMemoryMb=2560
>>> >
>>> > I have one other question to add to Paul's comments - does the OS user
>>> that you're running Maven under have write access to all of the source tree
>>> that you put at /opt/drill?
>>> >
>>> > On 2023/07/11 22:12, Paul Rogers wrote:
>>> >> Hi Mike,
>>> >>
>>> >> A quick glance at the log suggests a failure in the tests for the JSON
>>> >> reader, in the Mongo extended types. Drill's date/time support has
>>> >> historically been fragile. Some tests only work if your machine is
>>> set to
>>> >> use the UTC time zone (or Java is told to pretend that the time is
>>> UTC.)
>>> >> The Mongo types test fail

Re: drill tests not passing

2023-07-14 Thread Mike Beckerle
Update: I did a clean and install -DskipTests=true.

Then I tried the mvn test using the non-UTC timezone stuff, as suggested.

But alas, it still fails, this time the failure unique and is only in "Java
Execution Engine"

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-dependency-plugin:3.4.0:unpack
(unpack-vector-types) on project drill-java-exec: Artifact has not been
packaged yet. When used on reactor artifact, unpack should be executed
after packaging: see MDEP-98. -> [Help 1]

The command and complete trace output are below.

I need assistance on how to proceed.

Complete trace from the mvn test is attached.


On Thu, Jul 13, 2023 at 1:13 PM Mike Beckerle  wrote:

> To answer questions:
>
> 1. Paul: This is a 100% stock build. All I have done is clone the repo
> (master branch). Make a new git branch (in case I make future changes). Try
> to build (success) and test (failed so far).
>
> 2. James: The /opt/drill directory I created is owned by my userid and has
> full read/write access for all the development activities. I just put it
> there so it would have a shorter path to fix the first Hive-related glitch
> I encountered with the Linux 255 limit on file pathname length.
>
> I will try the suggested maven command line for non-UTC and see if things
> improve.
>
> The challenge for me as a newby is how do I know if I have everything
> properly configured?
>
> Can I just turn off building and testing of the Hive-related stuff in some
> supported/well-known way?
>
> If so, I would suggest I'd like to turn off not just Hive, but *as much
> as possible*. I really just need the embedded drill to work.
>
> I would agree with @Charles Givre   that a contrib
> package addition is the ideal approach and that's what I'll be attempting.
>
> -mikeb
>
> On Thu, Jul 13, 2023 at 10:59 AM Charles Givre  wrote:
>
>> I'll add some heresy here... IMHO, for the purposes of developing a DFDL
>> extension, you probably don't need all the Drill tests to run.  For your
>> project, my suggestion would be to add a module to the contrib package and
>> that way your changes are relatively self contained.
>> Best,
>> -- C
>>
>>
>>
>> > On Jul 13, 2023, at 10:27 AM, James Turton  wrote:
>> >
>> > Hi Mike
>> >
>> > Here's the command line I use to run tests on a machine that's not in
>> the UTC time zone (plus some unrelated memory size arguments).
>> >
>> > mvn test -Djunit.args="-Duser.timezone=UTC -Duser.language=en
>> -Duser.region=US" -DmemoryMb=2560 -DdirectMemoryMb=2560
>> >
>> > I have one other question to add to Paul's comments - does the OS user
>> that you're running Maven under have write access to all of the source tree
>> that you put at /opt/drill?
>> >
>> > On 2023/07/11 22:12, Paul Rogers wrote:
>> >> Hi Mike,
>> >>
>> >> A quick glance at the log suggests a failure in the tests for the JSON
>> >> reader, in the Mongo extended types. Drill's date/time support has
>> >> historically been fragile. Some tests only work if your machine is set
>> to
>> >> use the UTC time zone (or Java is told to pretend that the time is
>> UTC.)
>> >> The Mongo types test failure seems to be around a date/time test so
>> maybe
>> >> this is the issue?
>> >>
>> >> There are also failures indicating that the Drillbit (Drill server)
>> died.
>> >> Not sure how this can happen, as tests run Drill embedded (or used to.)
>> >> Looking earlier in the logs, it seems that the Drillbit didn't start
>> due to
>> >> UDF (user-defined function) failures:
>> >>
>> >> Found duplicated function in drill-custom-lower.jar:
>> >> custom_lower(VARCHAR-REQUIRED)
>> >> Found duplicated function in built-in: lower(VARCHAR-REQUIRED)
>> >>
>> >> Not sure how this could occur: it should have failed in all builds.
>> >>
>> >> Also:
>> >>
>> >> File
>> >>
>> /opt/drill/exec/java-exec/target/org.apache.drill.exec.udf.dynamic.TestDynamicUDFSupport/home/drill/happy/udf/staging/drill-custom-lower-sources.jar
>> >> does not exist on file system file:///
>> >>
>> >> This is complaining that Drill needs the source code (not just class
>> file)
>> >> for its built-in functions. Again, this should not fail in a standard
>> >> build, because if it did, it would fail in all builds.
>> >>
>> >> There are other odd errors as well.
>> >>
>> &g

Re: drill tests not passing

2023-07-13 Thread Mike Beckerle
To answer questions:

1. Paul: This is a 100% stock build. All I have done is clone the repo
(master branch). Make a new git branch (in case I make future changes). Try
to build (success) and test (failed so far).

2. James: The /opt/drill directory I created is owned by my userid and has
full read/write access for all the development activities. I just put it
there so it would have a shorter path to fix the first Hive-related glitch
I encountered with the Linux 255 limit on file pathname length.

I will try the suggested maven command line for non-UTC and see if things
improve.

The challenge for me as a newby is how do I know if I have everything
properly configured?

Can I just turn off building and testing of the Hive-related stuff in some
supported/well-known way?

If so, I would suggest I'd like to turn off not just Hive, but *as much as
possible*. I really just need the embedded drill to work.

I would agree with @Charles Givre   that a contrib
package addition is the ideal approach and that's what I'll be attempting.

-mikeb

On Thu, Jul 13, 2023 at 10:59 AM Charles Givre  wrote:

> I'll add some heresy here... IMHO, for the purposes of developing a DFDL
> extension, you probably don't need all the Drill tests to run.  For your
> project, my suggestion would be to add a module to the contrib package and
> that way your changes are relatively self contained.
> Best,
> -- C
>
>
>
> > On Jul 13, 2023, at 10:27 AM, James Turton  wrote:
> >
> > Hi Mike
> >
> > Here's the command line I use to run tests on a machine that's not in
> the UTC time zone (plus some unrelated memory size arguments).
> >
> > mvn test -Djunit.args="-Duser.timezone=UTC -Duser.language=en
> -Duser.region=US" -DmemoryMb=2560 -DdirectMemoryMb=2560
> >
> > I have one other question to add to Paul's comments - does the OS user
> that you're running Maven under have write access to all of the source tree
> that you put at /opt/drill?
> >
> > On 2023/07/11 22:12, Paul Rogers wrote:
> >> Hi Mike,
> >>
> >> A quick glance at the log suggests a failure in the tests for the JSON
> >> reader, in the Mongo extended types. Drill's date/time support has
> >> historically been fragile. Some tests only work if your machine is set
> to
> >> use the UTC time zone (or Java is told to pretend that the time is UTC.)
> >> The Mongo types test failure seems to be around a date/time test so
> maybe
> >> this is the issue?
> >>
> >> There are also failures indicating that the Drillbit (Drill server)
> died.
> >> Not sure how this can happen, as tests run Drill embedded (or used to.)
> >> Looking earlier in the logs, it seems that the Drillbit didn't start
> due to
> >> UDF (user-defined function) failures:
> >>
> >> Found duplicated function in drill-custom-lower.jar:
> >> custom_lower(VARCHAR-REQUIRED)
> >> Found duplicated function in built-in: lower(VARCHAR-REQUIRED)
> >>
> >> Not sure how this could occur: it should have failed in all builds.
> >>
> >> Also:
> >>
> >> File
> >>
> /opt/drill/exec/java-exec/target/org.apache.drill.exec.udf.dynamic.TestDynamicUDFSupport/home/drill/happy/udf/staging/drill-custom-lower-sources.jar
> >> does not exist on file system file:///
> >>
> >> This is complaining that Drill needs the source code (not just class
> file)
> >> for its built-in functions. Again, this should not fail in a standard
> >> build, because if it did, it would fail in all builds.
> >>
> >> There are other odd errors as well.
> >>
> >> Perhaps we should ask: is this a "stock" build? Check out Drill and run
> >> tests? Or, have you already started making changes for your project?
> >>
> >> - Paul
> >>
> >>
> >> On Tue, Jul 11, 2023 at 9:07 AM Mike Beckerle 
> wrote:
> >>
> >>> I have drill building and running its tests. Some tests fail: [ERROR]
> >>> Tests run: 4366, Failures: 2, Errors: 1, Skipped: 133
> >>>
> >>> I am wondering if there is perhaps some setup step that I missed in the
> >>> instructions.
> >>>
> >>> I have attached the output from the 'mvn clean install
> -DskipTests=false'
> >>> execution. (zipped)
> >>> I am running on Ubuntu 20.04, definitely have Java 8 setup.
> >>>
> >>> I'm hoping someone can skim it and spot the issue(s).
> >>>
> >>> Thanks for any help
> >>>
> >>> Mike Beckerle
> >>> Apache Daffodil PMC | daffodil.apache.org
> >>> OGF DFDL Workgroup Co-Chair |
> www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> >>> Owl Cyber Defense | www.owlcyberdefense.com
> >>>
> >>>
> >>>
> >
>
>


Drill and Highly Hierarchical Data from Daffodil

2023-07-11 Thread Mike Beckerle
In designing the integration of Apache Daffodil into Drill, I'm trying to
figure out how queries would look operating on deeply nested data.

Here's an example.

This is the path to many geo-location latLong field pairs in some
"messageSet" data:

messageSet/noc_message[*]/message_content/content/vmf/payload/message/K05_17/overlay_message/r1_group/item[*]/points_group/item[*]/latLong

This is sort-of like XPath, except in the above I have put "[*]" to
indicate the child elements that are vectors. You can see there are 3
nested vectors here.

Beneath that path are these two fields, which are what I would want out of
my query, along with some fields from higher up in the nest.

entity_latitude_1/degrees
entity_longitude_1/degrees

The tutorial information here

https://drill.apache.org/docs/selecting-nested-data-for-a-column/

describes how to index into JSON arrays with specific integer values, but I
don't want specific integers, I want all values of them.

Can someone show me what a hypothetical Drill query would look like that
pulls out all the values of this latLong pair?

My stab is:

SELECT pairs.entity_latitude_1.degrees AS lat,
pairs.entity_longitude_1.degrees AS lon FROM
 
messageSet.noc_message[*].message_content.content.vmf.payload.message.K05_17.overlay_message.r1_group.item[*].points_group.item[*].latLong
AS pairs

I'm not at all sure about the vectors in that though.

The other idea was this quasi-notation (that I'm making up on the fly here)
which treats each vector as a table.

SELECT pairs.entity_latitude_1.degrees AS lat,
pairs.entity_longitude_1.degrees AS lon FROM
  messageSet.noc_message AS messages,

messages.message_content.content.vmf.payload.message.K05_17.overlay_message.r1_group.item
AS parents
  parents.points_group.item AS items
  items.latLong AS pairs

I have no idea if that makes any sense at all for Drill

Any help greatly appreciated.

-Mike Beckerle


Re: Newby: First attempt to build drill - failure

2023-07-11 Thread Mike Beckerle
Should there be a ticket created about this:

/home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/classes/org/apache/hadoop/hive/metastore/api/ThriftHiveMetastore$drop_partition_by_name_with_environment_context_args$drop_partition_by_name_with_environment_context_argsTupleSchemeFactory.class

The largest part of that path is the file name part which has
"drop_partition_by_name_with_environment_context_args" appearing twice in
the class file name. This appears to be a generated name so we should be
able to shorten it.


On Tue, Jul 11, 2023 at 12:27 AM James Turton  wrote:

> Good news and welcome to Drill!
>
> I haven't heard of anyone runing into this problem before, and I build
> Drill under the directory /home/james/Development/apache/drill which
> isn't far off of what you tried in terms of length. I do see the
> 280-character path cited by Maven below though. Perhaps in your case the
> drill-hive-exec-shaded was downloaded from the Apache Snapshots repo,
> rather than built locally, and this issue only presents itself if the
> maven-dependency-plugin must unpack a very long file path from a
> downloaded jar.
>
>
> On 2023/07/10 18:23, Mike Beckerle wrote:
> > Never mind. The file name was > 255 long, so I have installed the drill
> > build tree in /opt and now the path is shorter than 255.
> >
> >
> > On Mon, Jul 10, 2023 at 12:00 PM Mike Beckerle 
> wrote:
> >
> >> I'm trying to build the current master branch as of today 2023-07-10.
> >>
> >> It fails due to a file-name too long issue.
> >>
> >> The command I issued is just "mvn clean install -DskipTests" per the
> >> instructions.
> >>
> >> I'm running on Linux, Ubuntu 20.04. Java 8.
> >>
> >> [INFO] --- maven-dependency-plugin:3.4.0:unpack (unpack) @
> >> drill-hive-exec-shaded ---
> >> [INFO] Configured Artifact:
> >>
> org.apache.drill.contrib.storage-hive:drill-hive-exec-shaded:1.22.0-SNAPSHOT:jar
> >> [INFO] Unpacking
> >>
> /home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/drill-hive-exec-shaded-1.22.0-SNAPSHOT.jar
> >> to
> >>
> /home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/classes
> >> with includes "**/**" and excludes ""
> >> [INFO]
> >> 
> >> [INFO] Reactor Summary for Drill : 1.22.0-SNAPSHOT:
> >> [INFO]
> >> [INFO] Drill :  SUCCESS [
> >>   3.974 s]
> >> [INFO] Drill : Tools :  SUCCESS [
> >>   0.226 s]
> >> [INFO] Drill : Tools : Freemarker codegen . SUCCESS [
> >>   3.762 s]
> >> [INFO] Drill : Protocol ... SUCCESS [
> >>   5.001 s]
> >> [INFO] Drill : Common . SUCCESS [
> >>   4.944 s]
> >> [INFO] Drill : Logical Plan ... SUCCESS [
> >>   5.991 s]
> >> [INFO] Drill : Exec : . SUCCESS [
> >>   0.210 s]
> >> [INFO] Drill : Exec : Memory :  SUCCESS [
> >>   0.179 s]
> >> [INFO] Drill : Exec : Memory : Base ... SUCCESS [
> >>   2.373 s]
> >> [INFO] Drill : Exec : RPC . SUCCESS [
> >>   2.436 s]
> >> [INFO] Drill : Exec : Vectors . SUCCESS [
> >> 54.917 s]
> >> [INFO] Drill : Contrib : .. SUCCESS [
> >>   0.138 s]
> >> [INFO] Drill : Contrib : Data : ... SUCCESS [
> >>   0.143 s]
> >> [INFO] Drill : Contrib : Data : TPCH Sample ... SUCCESS [
> >>   1.473 s]
> >> [INFO] Drill : Metastore :  SUCCESS [
> >>   0.144 s]
> >> [INFO] Drill : Metastore : API  SUCCESS [
> >>   4.366 s]
> >> [INFO] Drill : Metastore : Iceberg  SUCCESS [
> >>   3.940 s]
> >> [INFO] Drill : Exec : Java Execution Engine ... SUCCESS
> [01:04
> >> min]
> >> [INFO] Drill : Exec : JDBC Driver using dependencies .. SUCCESS [
> >>   7.332 s]
> >> [INFO] Drill : Exec : JDBC JAR with all dependencies .. SUCCESS [
> >> 16.304 s]
> >> [INFO] Drill : On-YARN  SUCCESS 

Re: Newby: First attempt to build drill - failure

2023-07-10 Thread Mike Beckerle
Never mind. The file name was > 255 long, so I have installed the drill
build tree in /opt and now the path is shorter than 255.


On Mon, Jul 10, 2023 at 12:00 PM Mike Beckerle  wrote:

> I'm trying to build the current master branch as of today 2023-07-10.
>
> It fails due to a file-name too long issue.
>
> The command I issued is just "mvn clean install -DskipTests" per the
> instructions.
>
> I'm running on Linux, Ubuntu 20.04. Java 8.
>
> [INFO] --- maven-dependency-plugin:3.4.0:unpack (unpack) @
> drill-hive-exec-shaded ---
> [INFO] Configured Artifact:
> org.apache.drill.contrib.storage-hive:drill-hive-exec-shaded:1.22.0-SNAPSHOT:jar
> [INFO] Unpacking
> /home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/drill-hive-exec-shaded-1.22.0-SNAPSHOT.jar
> to
> /home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/classes
> with includes "**/**" and excludes ""
> [INFO]
> 
> [INFO] Reactor Summary for Drill : 1.22.0-SNAPSHOT:
> [INFO]
> [INFO] Drill :  SUCCESS [
>  3.974 s]
> [INFO] Drill : Tools :  SUCCESS [
>  0.226 s]
> [INFO] Drill : Tools : Freemarker codegen . SUCCESS [
>  3.762 s]
> [INFO] Drill : Protocol ... SUCCESS [
>  5.001 s]
> [INFO] Drill : Common . SUCCESS [
>  4.944 s]
> [INFO] Drill : Logical Plan ... SUCCESS [
>  5.991 s]
> [INFO] Drill : Exec : . SUCCESS [
>  0.210 s]
> [INFO] Drill : Exec : Memory :  SUCCESS [
>  0.179 s]
> [INFO] Drill : Exec : Memory : Base ... SUCCESS [
>  2.373 s]
> [INFO] Drill : Exec : RPC . SUCCESS [
>  2.436 s]
> [INFO] Drill : Exec : Vectors . SUCCESS [
> 54.917 s]
> [INFO] Drill : Contrib : .. SUCCESS [
>  0.138 s]
> [INFO] Drill : Contrib : Data : ... SUCCESS [
>  0.143 s]
> [INFO] Drill : Contrib : Data : TPCH Sample ... SUCCESS [
>  1.473 s]
> [INFO] Drill : Metastore :  SUCCESS [
>  0.144 s]
> [INFO] Drill : Metastore : API  SUCCESS [
>  4.366 s]
> [INFO] Drill : Metastore : Iceberg  SUCCESS [
>  3.940 s]
> [INFO] Drill : Exec : Java Execution Engine ... SUCCESS [01:04
> min]
> [INFO] Drill : Exec : JDBC Driver using dependencies .. SUCCESS [
>  7.332 s]
> [INFO] Drill : Exec : JDBC JAR with all dependencies .. SUCCESS [
> 16.304 s]
> [INFO] Drill : On-YARN  SUCCESS [
>  5.477 s]
> [INFO] Drill : Metastore : RDBMS .. SUCCESS [
>  6.704 s]
> [INFO] Drill : Metastore : Mongo .. SUCCESS [
>  3.621 s]
> [INFO] Drill : Contrib : Storage : Kudu ... SUCCESS [
>  6.693 s]
> [INFO] Drill : Contrib : Format : XML . SUCCESS [
>  3.511 s]
> [INFO] Drill : Contrib : Storage : HTTP ... SUCCESS [
>  5.195 s]
> [INFO] Drill : Contrib : Storage : OpenTSDB ... SUCCESS [
>  3.561 s]
> [INFO] Drill : Contrib : Storage : MongoDB  SUCCESS [
>  4.850 s]
> [INFO] Drill : Contrib : Storage : HBase .. SUCCESS [
> 10.857 s]
> [INFO] Drill : Contrib : Storage : JDBC ... SUCCESS [
>  4.413 s]
> [INFO] Drill : Contrib : Storage : Hive : . SUCCESS [
>  0.128 s]
> [INFO] Drill : Contrib : Storage : Hive : Exec Shaded . FAILURE [
> 19.135 s]
> [INFO] Drill : Contrib : Storage : Hive : Core  SKIPPED
> [INFO] Drill : Contrib : Storage : Kafka .. SKIPPED
> [INFO] Drill : Contrib : Storage : Cassandra .. SKIPPED
> [INFO] Drill : Contrib : Storage : ElasticSearch .. SKIPPED
> [INFO] Drill : Contrib : Storage : Splunk . SKIPPED
> [INFO] Drill : Contrib : Storage : GoogleSheets ... SKIPPED
> [INFO] Drill : Contrib : Storage : Phoenix  SKIPPED
> [INFO] Drill : Contrib : UDFs . SKIPPED
> [INFO] Drill : Contrib : Format : Syslog .. SKIPPED
> [INFO] Drill : Contrib : Format : Httpd/Nginx Access Log .. SKIPPED
> [INFO] Drill : Contrib : Format : PDF . SKIPPED
> [INFO] Drill : Contrib : Format : HDF5  SKIPPED
> [INFO] Drill : Contrib : Format : SPSS 

Newby: First attempt to build drill - failure

2023-07-10 Thread Mike Beckerle
I'm trying to build the current master branch as of today 2023-07-10.

It fails due to a file-name too long issue.

The command I issued is just "mvn clean install -DskipTests" per the
instructions.

I'm running on Linux, Ubuntu 20.04. Java 8.

[INFO] --- maven-dependency-plugin:3.4.0:unpack (unpack) @
drill-hive-exec-shaded ---
[INFO] Configured Artifact:
org.apache.drill.contrib.storage-hive:drill-hive-exec-shaded:1.22.0-SNAPSHOT:jar
[INFO] Unpacking
/home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/drill-hive-exec-shaded-1.22.0-SNAPSHOT.jar
to
/home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/classes
with includes "**/**" and excludes ""
[INFO]

[INFO] Reactor Summary for Drill : 1.22.0-SNAPSHOT:
[INFO]
[INFO] Drill :  SUCCESS [
 3.974 s]
[INFO] Drill : Tools :  SUCCESS [
 0.226 s]
[INFO] Drill : Tools : Freemarker codegen . SUCCESS [
 3.762 s]
[INFO] Drill : Protocol ... SUCCESS [
 5.001 s]
[INFO] Drill : Common . SUCCESS [
 4.944 s]
[INFO] Drill : Logical Plan ... SUCCESS [
 5.991 s]
[INFO] Drill : Exec : . SUCCESS [
 0.210 s]
[INFO] Drill : Exec : Memory :  SUCCESS [
 0.179 s]
[INFO] Drill : Exec : Memory : Base ... SUCCESS [
 2.373 s]
[INFO] Drill : Exec : RPC . SUCCESS [
 2.436 s]
[INFO] Drill : Exec : Vectors . SUCCESS [
54.917 s]
[INFO] Drill : Contrib : .. SUCCESS [
 0.138 s]
[INFO] Drill : Contrib : Data : ... SUCCESS [
 0.143 s]
[INFO] Drill : Contrib : Data : TPCH Sample ... SUCCESS [
 1.473 s]
[INFO] Drill : Metastore :  SUCCESS [
 0.144 s]
[INFO] Drill : Metastore : API  SUCCESS [
 4.366 s]
[INFO] Drill : Metastore : Iceberg  SUCCESS [
 3.940 s]
[INFO] Drill : Exec : Java Execution Engine ... SUCCESS [01:04
min]
[INFO] Drill : Exec : JDBC Driver using dependencies .. SUCCESS [
 7.332 s]
[INFO] Drill : Exec : JDBC JAR with all dependencies .. SUCCESS [
16.304 s]
[INFO] Drill : On-YARN  SUCCESS [
 5.477 s]
[INFO] Drill : Metastore : RDBMS .. SUCCESS [
 6.704 s]
[INFO] Drill : Metastore : Mongo .. SUCCESS [
 3.621 s]
[INFO] Drill : Contrib : Storage : Kudu ... SUCCESS [
 6.693 s]
[INFO] Drill : Contrib : Format : XML . SUCCESS [
 3.511 s]
[INFO] Drill : Contrib : Storage : HTTP ... SUCCESS [
 5.195 s]
[INFO] Drill : Contrib : Storage : OpenTSDB ... SUCCESS [
 3.561 s]
[INFO] Drill : Contrib : Storage : MongoDB  SUCCESS [
 4.850 s]
[INFO] Drill : Contrib : Storage : HBase .. SUCCESS [
10.857 s]
[INFO] Drill : Contrib : Storage : JDBC ... SUCCESS [
 4.413 s]
[INFO] Drill : Contrib : Storage : Hive : . SUCCESS [
 0.128 s]
[INFO] Drill : Contrib : Storage : Hive : Exec Shaded . FAILURE [
19.135 s]
[INFO] Drill : Contrib : Storage : Hive : Core  SKIPPED
[INFO] Drill : Contrib : Storage : Kafka .. SKIPPED
[INFO] Drill : Contrib : Storage : Cassandra .. SKIPPED
[INFO] Drill : Contrib : Storage : ElasticSearch .. SKIPPED
[INFO] Drill : Contrib : Storage : Splunk . SKIPPED
[INFO] Drill : Contrib : Storage : GoogleSheets ... SKIPPED
[INFO] Drill : Contrib : Storage : Phoenix  SKIPPED
[INFO] Drill : Contrib : UDFs . SKIPPED
[INFO] Drill : Contrib : Format : Syslog .. SKIPPED
[INFO] Drill : Contrib : Format : Httpd/Nginx Access Log .. SKIPPED
[INFO] Drill : Contrib : Format : PDF . SKIPPED
[INFO] Drill : Contrib : Format : HDF5  SKIPPED
[INFO] Drill : Contrib : Format : SPSS  SKIPPED
[INFO] Drill : Contrib : Format : SAS . SKIPPED
[INFO] Drill : Contrib : Format : LTSV  SKIPPED
[INFO] Drill : Contrib : Format : Image ... SKIPPED
[INFO] Drill : Contrib : Format : Pcap-NG . SKIPPED
[INFO] Drill : Contrib : Format : Esri  SKIPPED
[INFO] Drill : Contrib : Format : Excel ... SKIPPED
[INFO] Drill : Contrib : Format : MS Access ... SKIPPED
[INFO] Drill : Contrib : Format : Log Regex ... SKIPPED
[INFO] Drill : Contrib : Storage : Druid .. SKIPPED
[INFO] Drill : Contrib : Format : Iceberg . SKIPPED
[INFO] Drill : 

A deadline for Drill + Daffodil Integration - ApacheCon in Oct.

2023-07-06 Thread Mike Beckerle
I decided the only way to force getting this Drill + Daffodil integration
done, or at least started, is to have a deadline.

So I submitted this abstract below for the upcoming "Community over Code"
(formerly known as ApacheCon) conference this fall (Oct 7-10)

I'm hoping this forces some of the refactoring that is gating other efforts
and fixes in Daffodil at the same time.

*Direct Query of Arbitrary Data Formats using Apache Drill and Apache
Daffodil*


*Suppose you have data in an ad-hoc data format like **EDIFACT, ISO8583,
Asterix, some COBOL FD, or any other kind of data. **You can now describe
it with a Data Format Description Language (DFDL) schema, then u**sing
Apache Drill, you can directly query that data and those queries can also
incorporate data from any of Apache Drill's other array of data sources. **This
talk will describe the integration of Apache Drill with Apache Daffodil's
DFDL implementation. **This deep integration implements Drill's metadata
model in terms of the Daffodil DFDL metadata model, and implements Drill's
data model in terms of the Daffodil DFDL Infoset API. This enables Drill
queries to operate intelligently on DFDL-described data without the cost of
data conversion into an expensive intermediate form like JSON or XML. **The
talk will highlight the specific challenges in this integration and the
lessons learned that are applicable to integration of other Apache projects
having their own metadata and data models. *


Are any Drill Devs attending ApacheCon in NOLA? Hack Drill + Daffodil ?

2022-08-16 Thread Mike Beckerle
I am wondering if some of the Apache Drill devs are going to be at
ApacheCon in October.

I am hoping to do some hacking of Drill + Apache Daffodil to see how
hard/easy an integration would be.

The notion that given a DFDL schema you can immediately query the data, is
super attractive and our simple Daffodil Command line tool would be greatly
enhanced with a Drill-based query capability.

Will anyone also be at ApacheCon to help me hack this?

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com