Re: drill tests not passing

2023-07-11 Thread Paul Rogers
Hi Mike,

A quick glance at the log suggests a failure in the tests for the JSON
reader, in the Mongo extended types. Drill's date/time support has
historically been fragile. Some tests only work if your machine is set to
use the UTC time zone (or Java is told to pretend that the time is UTC.)
The Mongo types test failure seems to be around a date/time test so maybe
this is the issue?

There are also failures indicating that the Drillbit (Drill server) died.
Not sure how this can happen, as tests run Drill embedded (or used to.)
Looking earlier in the logs, it seems that the Drillbit didn't start due to
UDF (user-defined function) failures:

Found duplicated function in drill-custom-lower.jar:
custom_lower(VARCHAR-REQUIRED)
Found duplicated function in built-in: lower(VARCHAR-REQUIRED)

Not sure how this could occur: it should have failed in all builds.

Also:

File
/opt/drill/exec/java-exec/target/org.apache.drill.exec.udf.dynamic.TestDynamicUDFSupport/home/drill/happy/udf/staging/drill-custom-lower-sources.jar
does not exist on file system file:///

This is complaining that Drill needs the source code (not just class file)
for its built-in functions. Again, this should not fail in a standard
build, because if it did, it would fail in all builds.

There are other odd errors as well.

Perhaps we should ask: is this a "stock" build? Check out Drill and run
tests? Or, have you already started making changes for your project?

- Paul


On Tue, Jul 11, 2023 at 9:07 AM Mike Beckerle  wrote:

>
> I have drill building and running its tests. Some tests fail: [ERROR]
> Tests run: 4366, Failures: 2, Errors: 1, Skipped: 133
>
> I am wondering if there is perhaps some setup step that I missed in the
> instructions.
>
> I have attached the output from the 'mvn clean install -DskipTests=false'
> execution. (zipped)
> I am running on Ubuntu 20.04, definitely have Java 8 setup.
>
> I'm hoping someone can skim it and spot the issue(s).
>
> Thanks for any help
>
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com
>
>
>


Re: Drill and Highly Hierarchical Data from Daffodil

2023-07-11 Thread Paul Rogers
Drill can internally handle scalars, arrays (AKA vectors) and maps (AKA
tuples, structs). SQL, however, prefers to work with scalars: there is no
good syntax to reach inside a complex object for, say, a WHERE condition
without also projecting that item as a top-level scalar.

The cool thing, for ML use cases, is that Drill's arrays can also be
structured: a vector of input values each of which is a vector of data
points along with a class label.

That said, if you have a record with a field "obj" that is a map (struct,
object) that contains a field "coord" that is an array of two (or three)
doubles, you can project it as:

SELECT obj.coord FROM something

The value you get back will be an array. Drill's native API handles this
just fine. JDBC does not really speak "vector". So, in that case, you could
project the elements:

SELECT obj.coord[0] AS x, obj.coord[1] AS y FROM something

I find it helpful to first think about how Drill's internal data vectors
will look, then work from there to the SQL that will do what needs doing.

- Paul

On Tue, Jul 11, 2023 at 11:46 AM Charles Givre  wrote:

> HI Mike,
> When you say "you want all of them', can you clarify a bit about what
> you'd want the data to look like?
> Best,
> -- C
>
>
>
> > On Jul 11, 2023, at 12:33 PM, Mike Beckerle 
> wrote:
> >
> > In designing the integration of Apache Daffodil into Drill, I'm trying to
> > figure out how queries would look operating on deeply nested data.
> >
> > Here's an example.
> >
> > This is the path to many geo-location latLong field pairs in some
> > "messageSet" data:
> >
> >
> messageSet/noc_message[*]/message_content/content/vmf/payload/message/K05_17/overlay_message/r1_group/item[*]/points_group/item[*]/latLong
> >
> > This is sort-of like XPath, except in the above I have put "[*]" to
> > indicate the child elements that are vectors. You can see there are 3
> > nested vectors here.
> >
> > Beneath that path are these two fields, which are what I would want out
> of
> > my query, along with some fields from higher up in the nest.
> >
> > entity_latitude_1/degrees
> > entity_longitude_1/degrees
> >
> > The tutorial information here
> >
> >https://drill.apache.org/docs/selecting-nested-data-for-a-column/
> >
> > describes how to index into JSON arrays with specific integer values,
> but I
> > don't want specific integers, I want all values of them.
> >
> > Can someone show me what a hypothetical Drill query would look like that
> > pulls out all the values of this latLong pair?
> >
> > My stab is:
> >
> > SELECT pairs.entity_latitude_1.degrees AS lat,
> > pairs.entity_longitude_1.degrees AS lon FROM
> >
> messageSet.noc_message[*].message_content.content.vmf.payload.message.K05_17.overlay_message.r1_group.item[*].points_group.item[*].latLong
> > AS pairs
> >
> > I'm not at all sure about the vectors in that though.
> >
> > The other idea was this quasi-notation (that I'm making up on the fly
> here)
> > which treats each vector as a table.
> >
> > SELECT pairs.entity_latitude_1.degrees AS lat,
> > pairs.entity_longitude_1.degrees AS lon FROM
> >  messageSet.noc_message AS messages,
> >
> >
> messages.message_content.content.vmf.payload.message.K05_17.overlay_message.r1_group.item
> > AS parents
> >  parents.points_group.item AS items
> >  items.latLong AS pairs
> >
> > I have no idea if that makes any sense at all for Drill
> >
> > Any help greatly appreciated.
> >
> > -Mike Beckerle
>
>


Re: Drill and Highly Hierarchical Data from Daffodil

2023-07-11 Thread Charles Givre
HI Mike, 
When you say "you want all of them', can you clarify a bit about what you'd 
want the data to look like?
Best,
-- C



> On Jul 11, 2023, at 12:33 PM, Mike Beckerle  wrote:
> 
> In designing the integration of Apache Daffodil into Drill, I'm trying to
> figure out how queries would look operating on deeply nested data.
> 
> Here's an example.
> 
> This is the path to many geo-location latLong field pairs in some
> "messageSet" data:
> 
> messageSet/noc_message[*]/message_content/content/vmf/payload/message/K05_17/overlay_message/r1_group/item[*]/points_group/item[*]/latLong
> 
> This is sort-of like XPath, except in the above I have put "[*]" to
> indicate the child elements that are vectors. You can see there are 3
> nested vectors here.
> 
> Beneath that path are these two fields, which are what I would want out of
> my query, along with some fields from higher up in the nest.
> 
> entity_latitude_1/degrees
> entity_longitude_1/degrees
> 
> The tutorial information here
> 
>https://drill.apache.org/docs/selecting-nested-data-for-a-column/
> 
> describes how to index into JSON arrays with specific integer values, but I
> don't want specific integers, I want all values of them.
> 
> Can someone show me what a hypothetical Drill query would look like that
> pulls out all the values of this latLong pair?
> 
> My stab is:
> 
> SELECT pairs.entity_latitude_1.degrees AS lat,
> pairs.entity_longitude_1.degrees AS lon FROM
> messageSet.noc_message[*].message_content.content.vmf.payload.message.K05_17.overlay_message.r1_group.item[*].points_group.item[*].latLong
> AS pairs
> 
> I'm not at all sure about the vectors in that though.
> 
> The other idea was this quasi-notation (that I'm making up on the fly here)
> which treats each vector as a table.
> 
> SELECT pairs.entity_latitude_1.degrees AS lat,
> pairs.entity_longitude_1.degrees AS lon FROM
>  messageSet.noc_message AS messages,
> 
> messages.message_content.content.vmf.payload.message.K05_17.overlay_message.r1_group.item
> AS parents
>  parents.points_group.item AS items
>  items.latLong AS pairs
> 
> I have no idea if that makes any sense at all for Drill
> 
> Any help greatly appreciated.
> 
> -Mike Beckerle



Drill and Highly Hierarchical Data from Daffodil

2023-07-11 Thread Mike Beckerle
In designing the integration of Apache Daffodil into Drill, I'm trying to
figure out how queries would look operating on deeply nested data.

Here's an example.

This is the path to many geo-location latLong field pairs in some
"messageSet" data:

messageSet/noc_message[*]/message_content/content/vmf/payload/message/K05_17/overlay_message/r1_group/item[*]/points_group/item[*]/latLong

This is sort-of like XPath, except in the above I have put "[*]" to
indicate the child elements that are vectors. You can see there are 3
nested vectors here.

Beneath that path are these two fields, which are what I would want out of
my query, along with some fields from higher up in the nest.

entity_latitude_1/degrees
entity_longitude_1/degrees

The tutorial information here

https://drill.apache.org/docs/selecting-nested-data-for-a-column/

describes how to index into JSON arrays with specific integer values, but I
don't want specific integers, I want all values of them.

Can someone show me what a hypothetical Drill query would look like that
pulls out all the values of this latLong pair?

My stab is:

SELECT pairs.entity_latitude_1.degrees AS lat,
pairs.entity_longitude_1.degrees AS lon FROM
 
messageSet.noc_message[*].message_content.content.vmf.payload.message.K05_17.overlay_message.r1_group.item[*].points_group.item[*].latLong
AS pairs

I'm not at all sure about the vectors in that though.

The other idea was this quasi-notation (that I'm making up on the fly here)
which treats each vector as a table.

SELECT pairs.entity_latitude_1.degrees AS lat,
pairs.entity_longitude_1.degrees AS lon FROM
  messageSet.noc_message AS messages,

messages.message_content.content.vmf.payload.message.K05_17.overlay_message.r1_group.item
AS parents
  parents.points_group.item AS items
  items.latLong AS pairs

I have no idea if that makes any sense at all for Drill

Any help greatly appreciated.

-Mike Beckerle


Re: [I] NPE on DeltaRowGroupScan (drill)

2023-07-11 Thread via GitHub


cgivre closed issue #2810: NPE on DeltaRowGroupScan
URL: https://github.com/apache/drill/issues/2810


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: Newby: First attempt to build drill - failure

2023-07-11 Thread Charles Givre
Methinks the Hive plugin could probably use some attention. With that said, I 
don't know how much use it actually gets.  Yes... a ticket would probably be in 
order.
Best,
-- C



> On Jul 11, 2023, at 10:38 AM, Mike Beckerle  wrote:
> 
> Should there be a ticket created about this:
> 
> /home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/classes/org/apache/hadoop/hive/metastore/api/ThriftHiveMetastore$drop_partition_by_name_with_environment_context_args$drop_partition_by_name_with_environment_context_argsTupleSchemeFactory.class
> 
> The largest part of that path is the file name part which has
> "drop_partition_by_name_with_environment_context_args" appearing twice in
> the class file name. This appears to be a generated name so we should be
> able to shorten it.
> 
> 
> On Tue, Jul 11, 2023 at 12:27 AM James Turton  wrote:
> 
>> Good news and welcome to Drill!
>> 
>> I haven't heard of anyone runing into this problem before, and I build
>> Drill under the directory /home/james/Development/apache/drill which
>> isn't far off of what you tried in terms of length. I do see the
>> 280-character path cited by Maven below though. Perhaps in your case the
>> drill-hive-exec-shaded was downloaded from the Apache Snapshots repo,
>> rather than built locally, and this issue only presents itself if the
>> maven-dependency-plugin must unpack a very long file path from a
>> downloaded jar.
>> 
>> 
>> On 2023/07/10 18:23, Mike Beckerle wrote:
>>> Never mind. The file name was > 255 long, so I have installed the drill
>>> build tree in /opt and now the path is shorter than 255.
>>> 
>>> 
>>> On Mon, Jul 10, 2023 at 12:00 PM Mike Beckerle 
>> wrote:
>>> 
 I'm trying to build the current master branch as of today 2023-07-10.
 
 It fails due to a file-name too long issue.
 
 The command I issued is just "mvn clean install -DskipTests" per the
 instructions.
 
 I'm running on Linux, Ubuntu 20.04. Java 8.
 
 [INFO] --- maven-dependency-plugin:3.4.0:unpack (unpack) @
 drill-hive-exec-shaded ---
 [INFO] Configured Artifact:
 
>> org.apache.drill.contrib.storage-hive:drill-hive-exec-shaded:1.22.0-SNAPSHOT:jar
 [INFO] Unpacking
 
>> /home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/drill-hive-exec-shaded-1.22.0-SNAPSHOT.jar
 to
 
>> /home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/classes
 with includes "**/**" and excludes ""
 [INFO]
 
 [INFO] Reactor Summary for Drill : 1.22.0-SNAPSHOT:
 [INFO]
 [INFO] Drill :  SUCCESS [
  3.974 s]
 [INFO] Drill : Tools :  SUCCESS [
  0.226 s]
 [INFO] Drill : Tools : Freemarker codegen . SUCCESS [
  3.762 s]
 [INFO] Drill : Protocol ... SUCCESS [
  5.001 s]
 [INFO] Drill : Common . SUCCESS [
  4.944 s]
 [INFO] Drill : Logical Plan ... SUCCESS [
  5.991 s]
 [INFO] Drill : Exec : . SUCCESS [
  0.210 s]
 [INFO] Drill : Exec : Memory :  SUCCESS [
  0.179 s]
 [INFO] Drill : Exec : Memory : Base ... SUCCESS [
  2.373 s]
 [INFO] Drill : Exec : RPC . SUCCESS [
  2.436 s]
 [INFO] Drill : Exec : Vectors . SUCCESS [
 54.917 s]
 [INFO] Drill : Contrib : .. SUCCESS [
  0.138 s]
 [INFO] Drill : Contrib : Data : ... SUCCESS [
  0.143 s]
 [INFO] Drill : Contrib : Data : TPCH Sample ... SUCCESS [
  1.473 s]
 [INFO] Drill : Metastore :  SUCCESS [
  0.144 s]
 [INFO] Drill : Metastore : API  SUCCESS [
  4.366 s]
 [INFO] Drill : Metastore : Iceberg  SUCCESS [
  3.940 s]
 [INFO] Drill : Exec : Java Execution Engine ... SUCCESS
>> [01:04
 min]
 [INFO] Drill : Exec : JDBC Driver using dependencies .. SUCCESS [
  7.332 s]
 [INFO] Drill : Exec : JDBC JAR with all dependencies .. SUCCESS [
 16.304 s]
 [INFO] Drill : On-YARN  SUCCESS [
  5.477 s]
 [INFO] Drill : Metastore : RDBMS .. SUCCESS [
  6.704 s]
 [INFO] Drill : Metastore : Mongo .. SUCCESS [
  3.621 s]
 [INFO] Drill : Contrib : Storage : Kudu ... SUCCESS [
  6.693 s]
 [INFO] Drill : Contrib : Format : XML . SUCCESS [
  3.511 s]
 [INFO] Drill : Contrib : Storage : HTTP 

Re: Newby: First attempt to build drill - failure

2023-07-11 Thread Mike Beckerle
Should there be a ticket created about this:

/home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/classes/org/apache/hadoop/hive/metastore/api/ThriftHiveMetastore$drop_partition_by_name_with_environment_context_args$drop_partition_by_name_with_environment_context_argsTupleSchemeFactory.class

The largest part of that path is the file name part which has
"drop_partition_by_name_with_environment_context_args" appearing twice in
the class file name. This appears to be a generated name so we should be
able to shorten it.


On Tue, Jul 11, 2023 at 12:27 AM James Turton  wrote:

> Good news and welcome to Drill!
>
> I haven't heard of anyone runing into this problem before, and I build
> Drill under the directory /home/james/Development/apache/drill which
> isn't far off of what you tried in terms of length. I do see the
> 280-character path cited by Maven below though. Perhaps in your case the
> drill-hive-exec-shaded was downloaded from the Apache Snapshots repo,
> rather than built locally, and this issue only presents itself if the
> maven-dependency-plugin must unpack a very long file path from a
> downloaded jar.
>
>
> On 2023/07/10 18:23, Mike Beckerle wrote:
> > Never mind. The file name was > 255 long, so I have installed the drill
> > build tree in /opt and now the path is shorter than 255.
> >
> >
> > On Mon, Jul 10, 2023 at 12:00 PM Mike Beckerle 
> wrote:
> >
> >> I'm trying to build the current master branch as of today 2023-07-10.
> >>
> >> It fails due to a file-name too long issue.
> >>
> >> The command I issued is just "mvn clean install -DskipTests" per the
> >> instructions.
> >>
> >> I'm running on Linux, Ubuntu 20.04. Java 8.
> >>
> >> [INFO] --- maven-dependency-plugin:3.4.0:unpack (unpack) @
> >> drill-hive-exec-shaded ---
> >> [INFO] Configured Artifact:
> >>
> org.apache.drill.contrib.storage-hive:drill-hive-exec-shaded:1.22.0-SNAPSHOT:jar
> >> [INFO] Unpacking
> >>
> /home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/drill-hive-exec-shaded-1.22.0-SNAPSHOT.jar
> >> to
> >>
> /home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/classes
> >> with includes "**/**" and excludes ""
> >> [INFO]
> >> 
> >> [INFO] Reactor Summary for Drill : 1.22.0-SNAPSHOT:
> >> [INFO]
> >> [INFO] Drill :  SUCCESS [
> >>   3.974 s]
> >> [INFO] Drill : Tools :  SUCCESS [
> >>   0.226 s]
> >> [INFO] Drill : Tools : Freemarker codegen . SUCCESS [
> >>   3.762 s]
> >> [INFO] Drill : Protocol ... SUCCESS [
> >>   5.001 s]
> >> [INFO] Drill : Common . SUCCESS [
> >>   4.944 s]
> >> [INFO] Drill : Logical Plan ... SUCCESS [
> >>   5.991 s]
> >> [INFO] Drill : Exec : . SUCCESS [
> >>   0.210 s]
> >> [INFO] Drill : Exec : Memory :  SUCCESS [
> >>   0.179 s]
> >> [INFO] Drill : Exec : Memory : Base ... SUCCESS [
> >>   2.373 s]
> >> [INFO] Drill : Exec : RPC . SUCCESS [
> >>   2.436 s]
> >> [INFO] Drill : Exec : Vectors . SUCCESS [
> >> 54.917 s]
> >> [INFO] Drill : Contrib : .. SUCCESS [
> >>   0.138 s]
> >> [INFO] Drill : Contrib : Data : ... SUCCESS [
> >>   0.143 s]
> >> [INFO] Drill : Contrib : Data : TPCH Sample ... SUCCESS [
> >>   1.473 s]
> >> [INFO] Drill : Metastore :  SUCCESS [
> >>   0.144 s]
> >> [INFO] Drill : Metastore : API  SUCCESS [
> >>   4.366 s]
> >> [INFO] Drill : Metastore : Iceberg  SUCCESS [
> >>   3.940 s]
> >> [INFO] Drill : Exec : Java Execution Engine ... SUCCESS
> [01:04
> >> min]
> >> [INFO] Drill : Exec : JDBC Driver using dependencies .. SUCCESS [
> >>   7.332 s]
> >> [INFO] Drill : Exec : JDBC JAR with all dependencies .. SUCCESS [
> >> 16.304 s]
> >> [INFO] Drill : On-YARN  SUCCESS [
> >>   5.477 s]
> >> [INFO] Drill : Metastore : RDBMS .. SUCCESS [
> >>   6.704 s]
> >> [INFO] Drill : Metastore : Mongo .. SUCCESS [
> >>   3.621 s]
> >> [INFO] Drill : Contrib : Storage : Kudu ... SUCCESS [
> >>   6.693 s]
> >> [INFO] Drill : Contrib : Format : XML . SUCCESS [
> >>   3.511 s]
> >> [INFO] Drill : Contrib : Storage : HTTP ... SUCCESS [
> >>   5.195 s]
> >> [INFO] Drill : Contrib : Storage : OpenTSDB ... SUCCESS [
> >>   3.561 s]
> >> [INFO] Drill : Contrib : Storage : MongoDB  SUCCESS [
> >>   4.850 s]
> >> [INFO] Drill : Contrib : Storage : HBase 

Re: [I] dir/*.parquet UNION ALL dir/*.json reading slow (drill)

2023-07-11 Thread via GitHub


pandalanax closed issue #2814: dir/*.parquet UNION ALL dir/*.json reading slow
URL: https://github.com/apache/drill/issues/2814


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] dir/*.parquet UNION ALL dir/*.json reading slow (drill)

2023-07-11 Thread via GitHub


pandalanax commented on issue #2814:
URL: https://github.com/apache/drill/issues/2814#issuecomment-1630227543

   We created a workaround for the time being while we are upgrading the Drill 
version in our cluster.  
   Workaround:
   create `n` empty json files via 
   ```bash
   for n in {1..160}; do hdfs dfs -touch /path/to/json/files/tmp${n}.json; done
   ``` 
   with `n` then matching the number of parquet files. Will close for now and 
reopen if the issue persists with 1.21. Will also try out Drill metastore, 
thanks for the suggestion!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org