Re: [PR] WIP: Preliminary Review on adding Daffodil to Drill (drill)

2023-10-18 Thread via GitHub


cgivre commented on code in PR #2836:
URL: https://github.com/apache/drill/pull/2836#discussion_r1364798604


##
contrib/pom.xml:
##
@@ -59,6 +59,7 @@
 format-pcapng
 format-iceberg
 format-deltalake
+format-daffodil

Review Comment:
   Please keep these in alphabetical order.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] WIP: Preliminary Review on adding Daffodil to Drill (drill)

2023-10-18 Thread via GitHub


cgivre commented on code in PR #2836:
URL: https://github.com/apache/drill/pull/2836#discussion_r1364797899


##
contrib/format-daffodil/src/test/java/org/apache/drill/exec/store/daffodil/TestDaffodilReader.java:
##
@@ -0,0 +1,652 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.daffodil;
+
+import org.apache.drill.categories.RowSetTest;
+import org.apache.drill.common.types.TypeProtos.DataMode;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.physical.rowSet.RowSet;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.test.ClusterFixture;
+import org.apache.drill.test.ClusterTest;
+import org.apache.drill.test.rowSet.RowSetComparison;
+import org.junit.BeforeClass;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+
+import java.nio.file.Paths;
+import java.time.Instant;
+import java.time.LocalDate;
+import java.time.LocalTime;
+
+import static org.apache.drill.test.QueryTestUtil.generateCompressedFile;
+import static org.apache.drill.test.rowSet.RowSetUtilities.mapArray;
+import static org.apache.drill.test.rowSet.RowSetUtilities.objArray;
+import static org.apache.drill.test.rowSet.RowSetUtilities.strArray;
+import static org.junit.Assert.assertEquals;
+
+@Category(RowSetTest.class)
+public class TestDaffodilReader extends ClusterTest {
+
+  @BeforeClass
+  public static void setup() throws Exception {
+ClusterTest.startCluster(ClusterFixture.builder(dirTestWatcher));
+
+DaffodilFormatConfig formatConfig = new DaffodilFormatConfig(null, "", "", 
false);
+
+// FIXME: What do these things do? specify xml extension file names are 
somehow significant?
+cluster.defineFormat("cp", "daffodil", formatConfig);
+cluster.defineFormat("dfs", "daffodil", formatConfig);
+
+// FIXME: Do we need this?

Review Comment:
   You'll need this if you want to run tests with a compressed file.  Drill 
should be able to read a compressed files so it may be worth throwing a test in 
with a zipped file or something.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] WIP: Preliminary Review on adding Daffodil to Drill (drill)

2023-10-18 Thread via GitHub


cgivre commented on code in PR #2836:
URL: https://github.com/apache/drill/pull/2836#discussion_r1364798270


##
distribution/src/assemble/component.xml:
##


Review Comment:
   Please keep these in alphabetical order.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] WIP: Preliminary Review on adding Daffodil to Drill (drill)

2023-10-18 Thread via GitHub


cgivre commented on code in PR #2836:
URL: https://github.com/apache/drill/pull/2836#discussion_r1364797516


##
contrib/format-daffodil/src/test/java/org/apache/drill/exec/store/daffodil/TestDaffodilReader.java:
##
@@ -0,0 +1,652 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.daffodil;
+
+import org.apache.drill.categories.RowSetTest;
+import org.apache.drill.common.types.TypeProtos.DataMode;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.physical.rowSet.RowSet;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.test.ClusterFixture;
+import org.apache.drill.test.ClusterTest;
+import org.apache.drill.test.rowSet.RowSetComparison;
+import org.junit.BeforeClass;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+
+import java.nio.file.Paths;
+import java.time.Instant;
+import java.time.LocalDate;
+import java.time.LocalTime;
+
+import static org.apache.drill.test.QueryTestUtil.generateCompressedFile;
+import static org.apache.drill.test.rowSet.RowSetUtilities.mapArray;
+import static org.apache.drill.test.rowSet.RowSetUtilities.objArray;
+import static org.apache.drill.test.rowSet.RowSetUtilities.strArray;
+import static org.junit.Assert.assertEquals;
+
+@Category(RowSetTest.class)
+public class TestDaffodilReader extends ClusterTest {
+
+  @BeforeClass
+  public static void setup() throws Exception {
+ClusterTest.startCluster(ClusterFixture.builder(dirTestWatcher));
+
+DaffodilFormatConfig formatConfig = new DaffodilFormatConfig(null, "", "", 
false);
+
+// FIXME: What do these things do? specify xml extension file names are 
somehow significant?
+cluster.defineFormat("cp", "daffodil", formatConfig);

Review Comment:
   These are equivalent to the default plugins you get in Drill when you first 
install it.  For your tests, you really only need one.  I'd just go with `dfs`, 
but it doesn't really matter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] WIP: Preliminary Review on adding Daffodil to Drill (drill)

2023-10-18 Thread via GitHub


cgivre commented on code in PR #2836:
URL: https://github.com/apache/drill/pull/2836#discussion_r1364795241


##
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java:
##
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.daffodil;
+
+import java.io.InputStream;
+import java.net.URI;
+import java.net.URISyntaxException;
+
+import org.apache.daffodil.japi.DataProcessor;
+import org.apache.drill.common.AutoCloseables;
+import org.apache.drill.common.exceptions.CustomErrorContext;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.physical.impl.scan.v3.ManagedReader;
+import org.apache.drill.exec.physical.impl.scan.v3.file.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import 
org.apache.drill.exec.store.daffodil.schema.DaffodilDataProcessorFactory;
+import org.apache.drill.exec.store.dfs.easy.EasySubScan;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static 
org.apache.drill.exec.store.daffodil.schema.DrillDaffodilSchemaUtils.daffodilDataProcessorToDrillSchema;
+
+
+public class DaffodilBatchReader implements ManagedReader {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(DaffodilBatchReader.class);
+  private final DaffodilFormatConfig formatConfig;
+  private final RowSetLoader rowSetLoader;
+  private final CustomErrorContext errorContext;
+  private final DaffodilMessageParser dafParser;
+  private final boolean validationMode;
+
+  private final InputStream dataInputStream;
+
+  static class DaffodilReaderConfig {
+final DaffodilFormatPlugin plugin;
+DaffodilReaderConfig(DaffodilFormatPlugin plugin) {
+  this.plugin = plugin;
+}
+  }
+
+  public DaffodilBatchReader (DaffodilReaderConfig readerConfig, EasySubScan 
scan, FileSchemaNegotiator negotiator) {
+
+errorContext = negotiator.parentErrorContext();
+this.formatConfig = readerConfig.plugin.getConfig();
+
+this.validationMode = formatConfig.getValidationMode();
+
+//
+// FIXME: Next, a MIRACLE occurs.
+//
+// We get the dfdlSchemaURI filled in from the query, or a default config 
location
+// We get the rootName (or null if not supplied) from the query, or a 
default config location
+// We get the rootNamespace (or null if not supplied) from the query, or a 
default config location
+// We get the validationMode (true/false) filled in from the query or a 
default config location
+// We get the dataInputURI filled in from the query, or from a default 
config location
+//
+// For a first cut, let's just fake it. :-)
+
+String rootName = null;
+String rootNamespace = null;
+
+URI dfdlSchemaURI;
+URI dataInputURI;
+
+try {

Review Comment:
   A few things... 
   1.  I added config variables to the config for the `rootName` and 
`rootNamespace`.  This means that you can set default values in the config or 
overwrite them in the query. 
   2. What it looks like to me is that we should do the same for the schema URI 
as well.  
   
   I think the object you're looking for here to access the file system would 
be the `negotiator.file().fileSystem()` object.  With that object you can 
access the file system directly either via `Path` or `URI`.   Take a peek at 
some of the methods available to you there.  
   
   As an example, in the SHP file reader, we do something similar:
   
   
https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java#L77-L83
   
   
   



##
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java:
##
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); 

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-18 Thread Paul Rogers
Hi Mike,

Earlier on, there were two approaches discussed:

1. Using a Daffodil schema to map to a Drill schema, and use Drill's
existing schema mechanisms for all of Drill's existing input formats.
2. Using a Daffodil-specific reader so that Daffodil does the data parsing.

Some of my earlier answers assumed you were doing option 1. The code shows
you are doing option 2. There are pros and cons, but let's just focus on
option 2 for now.

You need a way for a reader (running on Drillbit 2) to get a schema from a
query (planned on Drillbit 1). How does the Daffodil schema get from Node 1
to Node 2? Charles suggested ZK; I suggested that is not such a great idea,
for a number of reasons. A more "Drill-like" way would be to include the
Daffodil schema in the query plan: either as JSON or as a binary blob. The
planner attaches the schema when creating the reader definition; the reader
deserializes the schema at run time.

I believe you said schemas can be large. So, you could instead serialize a
reference. To do that, you'd need a location visible to all Drill nodes:
HDFS, S3, web server, etc. A crude-but-effective approach to get started is
the one mentioned for Drill's own metadata: the schema must reside in the
same directory as the data. This opens up issues with update race
conditions, as noted earlier. But, it could work if you are "careful." If
there is a Daffodil schema server, that would be better.

Given all that, your DaffodilBatchReader is generally headed in the right
direction. The same is true of DaffodilDrillInfosetOutputter, though, for
performance, you'll want to cache the column readers rather than do
name-based lookups for every column for every row. (Drill is designed to
read billions of rows; that's a lot of lookups!) But, that can be optimized
once things work.

You'll soon be at a place where you'll want to do some debugging. The
S-L-O-W way is to build Drill, fire of a query, and sort out what went
wrong, perhaps attaching a debugger. Another slow way is to fire up a
Drillbit in your test and run a query. (Such a test is a great integration
test, however.)

A good way to debug is to create a test that includes just your reader and
surrounding plumbing. This way, you can set up very specific cases and
easily debug, in a single thread, right from your IDE. The JSON reader
tests may have some examples. Charles may have others.

Thanks,

- Paul

On Wed, Oct 18, 2023 at 4:06 PM Charles Givre  wrote:

> Got it.  I’ll review today and tomorrow and hopefully we can get you
> unblocked.
> Sent from my iPhone
>
> > On Oct 18, 2023, at 18:01, Mike Beckerle  wrote:
> >
> > I am very much hoping someone will look at my open PR soon.
> > https://github.com/apache/drill/pull/2836
> >
> > I am basically blocked on this effort until you help me with one key area
> > of that.
> >
> > I expect the part I am puzzling over is routine to you, so it will save
> me
> > much effort.
> >
> > This is the key area in the DaffodilBatchReader.java code:
> >
> >  // FIXME: Next, a MIRACLE occurs.
> >  //
> >  // We get the dfdlSchemaURI filled in from the query, or a default
> config
> > location
> >  // We get the rootName (or null if not supplied) from the query, or a
> > default config location
> >  // We get the rootNamespace (or null if not supplied) from the query, or
> > a default config location
> >  // We get the validationMode (true/false) filled in from the query or a
> > default config location
> >  // We get the dataInputURI filled in from the query, or from a default
> > config location
> >  //
> >  // For a first cut, let's just fake it. :-)
> >  boolean validationMode = true;
> >  URI dfdlSchemaURI = new URI("schema/complexArray1.dfdl.xsd");
> >  String rootName = null;
> >  String rootNamespace = null;
> >  URI dataInputURI = new URI("data/complexArray1.dat");
> >
> >
> > I imagine this is just a few lines of code to grab these from the query,
> > and i don't even care about config files for now.
> >
> > I gave up on trying to figure out how to do this myself. It was actually
> > quite unclear from looking at the other format plugins. The way Drill
> does
> > configuration is obviously motivated by the distributed architecture
> > combined with pluggability, but all that combined with the negotation
> over
> > schemas which extends into runtime, and it all became quite muddy to me.
> I
> > think what I need is super straightforward, so i figured I should just
> > ask.
> >
> > This is just to get enough working (against local files only) that I can
> be
> > unblocked on creating and testing the rest of the Daffodil-to-Drill
> > metadata bridge and data bridge.
> >
> > My plan is to get all kinds of data and queries working first but just
> > against local-only files.  Fixing it to work in distributed Drill can
> come
> > later.
> >
> > -mikeb
> >
> >> On Wed, Oct 18, 2023 at 2:11 PM Paul Rogers  wrote:
> >>
> >> Hi Charles,
> >>
> >> The persistent store is just ZooKeeper, and ZK is known to work 

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-18 Thread Charles Givre
Got it.  I’ll review today and tomorrow and hopefully we can get you unblocked. 
 
Sent from my iPhone

> On Oct 18, 2023, at 18:01, Mike Beckerle  wrote:
> 
> I am very much hoping someone will look at my open PR soon.
> https://github.com/apache/drill/pull/2836
> 
> I am basically blocked on this effort until you help me with one key area
> of that.
> 
> I expect the part I am puzzling over is routine to you, so it will save me
> much effort.
> 
> This is the key area in the DaffodilBatchReader.java code:
> 
>  // FIXME: Next, a MIRACLE occurs.
>  //
>  // We get the dfdlSchemaURI filled in from the query, or a default config
> location
>  // We get the rootName (or null if not supplied) from the query, or a
> default config location
>  // We get the rootNamespace (or null if not supplied) from the query, or
> a default config location
>  // We get the validationMode (true/false) filled in from the query or a
> default config location
>  // We get the dataInputURI filled in from the query, or from a default
> config location
>  //
>  // For a first cut, let's just fake it. :-)
>  boolean validationMode = true;
>  URI dfdlSchemaURI = new URI("schema/complexArray1.dfdl.xsd");
>  String rootName = null;
>  String rootNamespace = null;
>  URI dataInputURI = new URI("data/complexArray1.dat");
> 
> 
> I imagine this is just a few lines of code to grab these from the query,
> and i don't even care about config files for now.
> 
> I gave up on trying to figure out how to do this myself. It was actually
> quite unclear from looking at the other format plugins. The way Drill does
> configuration is obviously motivated by the distributed architecture
> combined with pluggability, but all that combined with the negotation over
> schemas which extends into runtime, and it all became quite muddy to me. I
> think what I need is super straightforward, so i figured I should just
> ask.
> 
> This is just to get enough working (against local files only) that I can be
> unblocked on creating and testing the rest of the Daffodil-to-Drill
> metadata bridge and data bridge.
> 
> My plan is to get all kinds of data and queries working first but just
> against local-only files.  Fixing it to work in distributed Drill can come
> later.
> 
> -mikeb
> 
>> On Wed, Oct 18, 2023 at 2:11 PM Paul Rogers  wrote:
>> 
>> Hi Charles,
>> 
>> The persistent store is just ZooKeeper, and ZK is known to work poorly as
>> a distributed DB. ZK works great for things like tokens, node registrations
>> and the like. But, ZK scales very poorly for things like schemas (or query
>> profiles or a list of active queries.)
>> 
>> A more scalable approach may be to cache the schemas in each Drillbit,
>> then translate them to Drill's format and include them in each Scan
>> operator definition sent to each execution Drillbit. That solution avoids
>> race conditions when the schemas change while a query is in flight. This
>> is, in fact, the model used for storage plugin definitions. (The storage
>> plugin definitions are, in fact, stored in ZK, but tend to be small and few
>> in number.)
>> 
>> - Paul
>> 
>> 
>>> On Wed, Oct 18, 2023 at 7:51 AM Charles Givre  wrote:
>>> 
>>> Hi Mike,
>>> I hope all is well.  I remembered one other piece which might be useful
>>> for you.  Drill has an interface called a PersistentStore which is used for
>>> storing artifacts such as tokens etc.  I've uesd it on two occasions: in
>>> the GoogleSheets plugin and the Http plugin.  In both cases, I used it to
>>> store OAuth user tokens which need to be preserved and shared across
>>> drillbits, and also frequently updated.  I was thinking that this might be
>>> useful for caching the DFDL schemata.  If you take a look here:
>>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java,
>>> 
>>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth.
>>> and here
>>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java,
>>> you can see how I used that.
>>> 
>>> Best,
>>> -- C
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
 On Oct 13, 2023, at 1:25 PM, Mike Beckerle 
>>> wrote:
 
 Very helpful.
 
 Answers to your questions, and comments are below:
 
 On Thu, Oct 12, 2023 at 5:14 PM Charles Givre >> > wrote:
> HI Mike,
> I hope all is well.  I'll take a stab at answering your questions.
>>> But I have a few questions as well:
> 
> 1.  Are you writing a storage or format plugin for DFDL?  My thinking
>>> was that this would be a format plugin, but let me know if you were
>>> thinking differently
 
 Format plugin.
 
> 2.  In traditional deployments, where do people store the DFDL
>>> schemata files?  Are they local or accessible via URL?
 
 Schemas are stored in files, or in jar files created 

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-18 Thread Mike Beckerle
I am very much hoping someone will look at my open PR soon.
https://github.com/apache/drill/pull/2836

I am basically blocked on this effort until you help me with one key area
of that.

I expect the part I am puzzling over is routine to you, so it will save me
much effort.

This is the key area in the DaffodilBatchReader.java code:

  // FIXME: Next, a MIRACLE occurs.
  //
  // We get the dfdlSchemaURI filled in from the query, or a default config
location
  // We get the rootName (or null if not supplied) from the query, or a
default config location
  // We get the rootNamespace (or null if not supplied) from the query, or
a default config location
  // We get the validationMode (true/false) filled in from the query or a
default config location
  // We get the dataInputURI filled in from the query, or from a default
config location
  //
  // For a first cut, let's just fake it. :-)
  boolean validationMode = true;
  URI dfdlSchemaURI = new URI("schema/complexArray1.dfdl.xsd");
  String rootName = null;
  String rootNamespace = null;
  URI dataInputURI = new URI("data/complexArray1.dat");


I imagine this is just a few lines of code to grab these from the query,
and i don't even care about config files for now.

I gave up on trying to figure out how to do this myself. It was actually
quite unclear from looking at the other format plugins. The way Drill does
configuration is obviously motivated by the distributed architecture
combined with pluggability, but all that combined with the negotation over
schemas which extends into runtime, and it all became quite muddy to me. I
think what I need is super straightforward, so i figured I should just
ask.

This is just to get enough working (against local files only) that I can be
unblocked on creating and testing the rest of the Daffodil-to-Drill
metadata bridge and data bridge.

My plan is to get all kinds of data and queries working first but just
against local-only files.  Fixing it to work in distributed Drill can come
later.

-mikeb

On Wed, Oct 18, 2023 at 2:11 PM Paul Rogers  wrote:

> Hi Charles,
>
> The persistent store is just ZooKeeper, and ZK is known to work poorly as
> a distributed DB. ZK works great for things like tokens, node registrations
> and the like. But, ZK scales very poorly for things like schemas (or query
> profiles or a list of active queries.)
>
> A more scalable approach may be to cache the schemas in each Drillbit,
> then translate them to Drill's format and include them in each Scan
> operator definition sent to each execution Drillbit. That solution avoids
> race conditions when the schemas change while a query is in flight. This
> is, in fact, the model used for storage plugin definitions. (The storage
> plugin definitions are, in fact, stored in ZK, but tend to be small and few
> in number.)
>
> - Paul
>
>
> On Wed, Oct 18, 2023 at 7:51 AM Charles Givre  wrote:
>
>> Hi Mike,
>> I hope all is well.  I remembered one other piece which might be useful
>> for you.  Drill has an interface called a PersistentStore which is used for
>> storing artifacts such as tokens etc.  I've uesd it on two occasions: in
>> the GoogleSheets plugin and the Http plugin.  In both cases, I used it to
>> store OAuth user tokens which need to be preserved and shared across
>> drillbits, and also frequently updated.  I was thinking that this might be
>> useful for caching the DFDL schemata.  If you take a look here:
>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java,
>>
>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth.
>> and here
>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java,
>> you can see how I used that.
>>
>> Best,
>> -- C
>>
>>
>>
>>
>>
>>
>> > On Oct 13, 2023, at 1:25 PM, Mike Beckerle 
>> wrote:
>> >
>> > Very helpful.
>> >
>> > Answers to your questions, and comments are below:
>> >
>> > On Thu, Oct 12, 2023 at 5:14 PM Charles Givre > > wrote:
>> >> HI Mike,
>> >> I hope all is well.  I'll take a stab at answering your questions.
>> But I have a few questions as well:
>> >>
>> >> 1.  Are you writing a storage or format plugin for DFDL?  My thinking
>> was that this would be a format plugin, but let me know if you were
>> thinking differently
>> >
>> > Format plugin.
>> >
>> >> 2.  In traditional deployments, where do people store the DFDL
>> schemata files?  Are they local or accessible via URL?
>> >
>> > Schemas are stored in files, or in jar files created when packaging a
>> schema project. Hence URI is the preferred identifier for them.  They are
>> not retrieved remotely or anything like that. It's a matter of whether they
>> are in jars on the classpath, directories on the classpath, or just a file
>> location.
>> >
>> > The source-code of DFDL schemas are often created using 

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-18 Thread Paul Rogers
Hi Charles,

The persistent store is just ZooKeeper, and ZK is known to work poorly as a
distributed DB. ZK works great for things like tokens, node registrations
and the like. But, ZK scales very poorly for things like schemas (or query
profiles or a list of active queries.)

A more scalable approach may be to cache the schemas in each Drillbit, then
translate them to Drill's format and include them in each Scan operator
definition sent to each execution Drillbit. That solution avoids race
conditions when the schemas change while a query is in flight. This is, in
fact, the model used for storage plugin definitions. (The storage plugin
definitions are, in fact, stored in ZK, but tend to be small and few in
number.)

- Paul


On Wed, Oct 18, 2023 at 7:51 AM Charles Givre  wrote:

> Hi Mike,
> I hope all is well.  I remembered one other piece which might be useful
> for you.  Drill has an interface called a PersistentStore which is used for
> storing artifacts such as tokens etc.  I've uesd it on two occasions: in
> the GoogleSheets plugin and the Http plugin.  In both cases, I used it to
> store OAuth user tokens which need to be preserved and shared across
> drillbits, and also frequently updated.  I was thinking that this might be
> useful for caching the DFDL schemata.  If you take a look here:
> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java,
>
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth.
> and here
> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java,
> you can see how I used that.
>
> Best,
> -- C
>
>
>
>
>
>
> > On Oct 13, 2023, at 1:25 PM, Mike Beckerle  wrote:
> >
> > Very helpful.
> >
> > Answers to your questions, and comments are below:
> >
> > On Thu, Oct 12, 2023 at 5:14 PM Charles Givre  cgi...@gmail.com>> wrote:
> >> HI Mike,
> >> I hope all is well.  I'll take a stab at answering your questions.  But
> I have a few questions as well:
> >>
> >> 1.  Are you writing a storage or format plugin for DFDL?  My thinking
> was that this would be a format plugin, but let me know if you were
> thinking differently
> >
> > Format plugin.
> >
> >> 2.  In traditional deployments, where do people store the DFDL schemata
> files?  Are they local or accessible via URL?
> >
> > Schemas are stored in files, or in jar files created when packaging a
> schema project. Hence URI is the preferred identifier for them.  They are
> not retrieved remotely or anything like that. It's a matter of whether they
> are in jars on the classpath, directories on the classpath, or just a file
> location.
> >
> > The source-code of DFDL schemas are often created using other schemas as
> components, so a single "DFDL schema" may have parts that come from 5 jar
> files on the classpath e.g., 2 different header schemas, a library schema,
> and the "main" schema that assembles them all.  Inside schemas they refer
> to each other via xs:include or xs:import, and the schemaLocation attribute
> takes a URI to the location of the included/imported schema and those URIs
> are interpreted this same way we would want Drill to identify the location
> of a schema.
> >
> > However, really people will want to pre-compile any real non-toy/test
> DFDL schemas into binary ".bin" files for faster loading. Otherwise
> Daffodil schema compilation time can be excessive (minutes for large DFDL
> schemas - for example the DFDL schema for VMF is 180K lines of DFDL).
> Compiled schemas live in exactly 1 file (relatively small. The compiled
> form of VMF schema is 8Mbytes). So the path given for schema in Drill sql
> query, or in the config wants to be allowed to be either a compiled schema
> or a source-code schema (.xsd) this latter mostly being for test, training,
> and toy examples that we would compile on-the-fly.
> >
> >> To get the DFDL schema file or URL we have a few options, all of which
> revolve around setting a config variable.  For now, let's just say that the
> schema file is contained in the same folder as the data.  (We can make this
> more sophisticated later...)
> >
> > It would make life difficult if the schemas and test data must be
> co-resident. Most schema projects have these in entirely separate
> sub-trees. Schema will be under src/main/resources//xsd, compiled
> schema would be under target/... and test data under
> src/test/resources/.../data
> >
> > For now I think the easiest thing is just we get two URIs. One is for
> the data, one is for the schema. We access them via
> getClass().getResource().
> >
> > We should not worry about caching or anything for now. Once the above
> works for a decent scope of tests we can worry about making it more
> convenient to have a library of schemas at one's disposal.
> >
> >>
> >> Here's what you have to do.
> >>
> >> 1.  In the formatConfig file, 

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-18 Thread Charles Givre
Hi Mike, 
I hope all is well.  I remembered one other piece which might be useful for 
you.  Drill has an interface called a PersistentStore which is used for storing 
artifacts such as tokens etc.  I've uesd it on two occasions: in the 
GoogleSheets plugin and the Http plugin.  In both cases, I used it to store 
OAuth user tokens which need to be preserved and shared across drillbits, and 
also frequently updated.  I was thinking that this might be useful for caching 
the DFDL schemata.  If you take a look here: 
https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java,
 
https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth.
 and here 
https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java,
 you can see how I used that.

Best,
-- C

  




> On Oct 13, 2023, at 1:25 PM, Mike Beckerle  wrote:
> 
> Very helpful.
> 
> Answers to your questions, and comments are below:
> 
> On Thu, Oct 12, 2023 at 5:14 PM Charles Givre  > wrote:
>> HI Mike, 
>> I hope all is well.  I'll take a stab at answering your questions.  But I 
>> have a few questions as well:
>>  
>> 1.  Are you writing a storage or format plugin for DFDL?  My thinking was 
>> that this would be a format plugin, but let me know if you were thinking 
>> differently
> 
> Format plugin.
>  
>> 2.  In traditional deployments, where do people store the DFDL schemata 
>> files?  Are they local or accessible via URL?
> 
> Schemas are stored in files, or in jar files created when packaging a schema 
> project. Hence URI is the preferred identifier for them.  They are not 
> retrieved remotely or anything like that. It's a matter of whether they are 
> in jars on the classpath, directories on the classpath, or just a file 
> location. 
> 
> The source-code of DFDL schemas are often created using other schemas as 
> components, so a single "DFDL schema" may have parts that come from 5 jar 
> files on the classpath e.g., 2 different header schemas, a library schema, 
> and the "main" schema that assembles them all.  Inside schemas they refer to 
> each other via xs:include or xs:import, and the schemaLocation attribute 
> takes a URI to the location of the included/imported schema and those URIs 
> are interpreted this same way we would want Drill to identify the location of 
> a schema. 
> 
> However, really people will want to pre-compile any real non-toy/test DFDL 
> schemas into binary ".bin" files for faster loading. Otherwise Daffodil 
> schema compilation time can be excessive (minutes for large DFDL schemas - 
> for example the DFDL schema for VMF is 180K lines of DFDL). Compiled schemas 
> live in exactly 1 file (relatively small. The compiled form of VMF schema is 
> 8Mbytes). So the path given for schema in Drill sql query, or in the config 
> wants to be allowed to be either a compiled schema or a source-code schema 
> (.xsd) this latter mostly being for test, training, and toy examples that we 
> would compile on-the-fly.  
>  
>> To get the DFDL schema file or URL we have a few options, all of which 
>> revolve around setting a config variable.  For now, let's just say that the 
>> schema file is contained in the same folder as the data.  (We can make this 
>> more sophisticated later...)
> 
> It would make life difficult if the schemas and test data must be 
> co-resident. Most schema projects have these in entirely separate sub-trees. 
> Schema will be under src/main/resources//xsd, compiled schema would be 
> under target/... and test data under src/test/resources/.../data
> 
> For now I think the easiest thing is just we get two URIs. One is for the 
> data, one is for the schema. We access them via getClass().getResource(). 
> 
> We should not worry about caching or anything for now. Once the above works 
> for a decent scope of tests we can worry about making it more convenient to 
> have a library of schemas at one's disposal. 
>  
>> 
>> Here's what you have to do.
>> 
>> 1.  In the formatConfig file, define a String called 'dfdlSchema'.   Note... 
>> config variables must be private and final.  If they aren't it can cause 
>> weird errors that are really difficult to debug.  For some reference, take a 
>> look at the Excel plugin.  
>> (https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java)
>> 
>> Setting a config variable there will allow a user to set a global schema 
>> definition.  This can also be configured individually for various 
>> workspaces.  So let's say you had PCAP files in one workspace, you could 
>> globally set the DFDL file for that and then another workspace which has 
>> some other file, you could create another DFDL plugin instance for that. 
> 
> Ok, so the above lets me play with Drill and one