[jira] [Resolved] (DRILL-8375) Incomplete support for non-projected complex vectors

2024-01-07 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-8375.

Resolution: Fixed

> Incomplete support for non-projected complex vectors
> 
>
> Key: DRILL-8375
> URL: https://issues.apache.org/jira/browse/DRILL-8375
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> The `ResultSetLoader` implementation supports all of Drill's vector types. 
> However, DRILL-8188 discovered holes in support for non-projected vectors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8375) Incomplete support for non-projected complex vectors

2022-12-24 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8375:
--

 Summary: Incomplete support for non-projected complex vectors
 Key: DRILL-8375
 URL: https://issues.apache.org/jira/browse/DRILL-8375
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


The `ResultSetLoader` implementation supports all of Drill's vector types. 
However, DRILL-8188 discovered holes in support for non-projected vectors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8185) EVF 2 doen't handle map arrays or nested maps

2022-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8185:
--

 Summary: EVF 2 doen't handle map arrays or nested maps
 Key: DRILL-8185
 URL: https://issues.apache.org/jira/browse/DRILL-8185
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.20.0
Reporter: Paul Rogers
Assignee: Paul Rogers


When converting Avro, Luoc found two bugs in how EVF 2 (the projection 
mechanism) handles map array and nested maps



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8159) Upgrade HTTPD, Text readers to use EVF3

2022-03-06 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8159:
--

 Summary: Upgrade HTTPD, Text readers to use EVF3
 Key: DRILL-8159
 URL: https://issues.apache.org/jira/browse/DRILL-8159
 Project: Apache Drill
  Issue Type: New Feature
Reporter: Paul Rogers
Assignee: Paul Rogers


Continuation of work originally in the DRILL-8085 PR.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8124) Fix implicit file issue with EVF 2

2022-02-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8124:
--

 Summary: Fix implicit file issue with EVF 2
 Key: DRILL-8124
 URL: https://issues.apache.org/jira/browse/DRILL-8124
 Project: Apache Drill
  Issue Type: New Feature
Reporter: Paul Rogers
Assignee: Paul Rogers


Unit testing with EVF 2 found an issue in the handling of implicit columns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8123) Revise scan limit pushdown

2022-02-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8123:
--

 Summary: Revise scan limit pushdown
 Key: DRILL-8123
 URL: https://issues.apache.org/jira/browse/DRILL-8123
 Project: Apache Drill
  Issue Type: New Feature
Reporter: Paul Rogers
Assignee: Paul Rogers


Recent work added a push down of the limit into a scan. The work had a few 
holes, one of which was plugged by the recent update of EVF to manage the 
limit. Another hole is that the physical plan uses a value of 0 to indicate no 
limit, but 0 is a perfectly valid limit, it means "no data, only schema." The 
field name is "maxRecords", but should be "limit" to indicate the purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8115) LIMIT pushdown into EVF

2022-01-28 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8115:
--

 Summary: LIMIT pushdown into EVF
 Key: DRILL-8115
 URL: https://issues.apache.org/jira/browse/DRILL-8115
 Project: Apache Drill
  Issue Type: New Feature
Reporter: Paul Rogers
Assignee: Paul Rogers


Add LIMIT support to the scan framework and EVF so that plugins don't have to 
implement it themselves.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8102) Tests use significant space outside the drill directory

2022-01-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8102:
--

 Summary: Tests use significant space outside the drill directory
 Key: DRILL-8102
 URL: https://issues.apache.org/jira/browse/DRILL-8102
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers


I use a Linux Mint (Ubuntu) machine in which the root file system has limited 
space, but /user has a large amount of space. My Drill build directory is 
within my home directory in /user. Most tests write to the various target 
folders within the Drill directory, which ensures that each test is isolated, 
and that test files are removed in a {{{}mvn clean{}}}.

However, it appears that some tests, perhaps Cassandra, ElasticSearch or Spunk, 
write to directories outside of Drill, perhaps to /tmp, /var, etc. The result 
is that, each time I run the tests, I get low disk-space warnings on my root 
file system. In the worst case, the tests fail due to lack of disk space.

Since it is not clear where the files are written, it is not clear what I 
should clean up, or how I might add a sym link to a location with more space. 
(Yes, I could get a bigger SSD, and rebuild my root file system, but I'm 
lazy...)

As a general rule, all Drill tests should write to a target directory. If that 
is not possible, then clearly state somewhere what directories are used so that 
sufficient space can be provided, and we know where to go clean up files once 
the build runs.

Perhaps some of the tests start Docker containers? If so, then, again, it 
should be made clear how much cache space Docker will require.

Another suggestion is to change the build order. Those tests which require 
external resources should occur last, after all the others (UDFs, Syslog, etc.) 
which require only Drill. That way, if failures occur in the external systems, 
we at least know the core Drill modules work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8101) Resolve the TIMESTAMP madness

2022-01-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8101:
--

 Summary: Resolve the TIMESTAMP madness
 Key: DRILL-8101
 URL: https://issues.apache.org/jira/browse/DRILL-8101
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers


Drill's TIMESAMP type tries to be two different things at the same time, 
causing incorrect results when the two interpretations collide.

Drill has the classic DATE and TIME data types. A DATE is just that: a day 
wherever you happen to be. Your birthday goes from midnight to midnight in the 
time zone where you find yourself. If you happen to travel around the world, 
you can make your birthday last almost 48 hours as midnight of your birthday 
starts at the international date line, circles the globe, followed by the 
midnight of the next day.

Similarly, a time is a time where you are. 12:00PM is noon (more-or-less) as 
determined by the sun. 12:00PM occurs once in every time zone every day. Since 
there are many time zones, there are many noons each day.

These are both examples of local time. Most databases combine these two ideas 
to get a DATETIME: a date and time wherever you are.

In our modern world, knowing something occurred on 2022-01-02 12:00:00 is not 
good enough. Did it occur at that time in my time zone or yours? If the event 
is a user login, or a network breach, then it occurred once, at a specific 
time, it did not occur many times: once in each time zone. Hence, machines 
often use UTC time to coordinate.

Unix-like systems also define the idea of a "timestamp", the number of seconds 
(or milliseconds or nanoseconds) since 1970-01-01 00:00:00. This is the time 
reported by Java in the {{System.currentTime()}} function. It is the time most 
often found in machine-generated logs. It may be as a number (ms since the 
epoch) or as an ISO-formatted string.

Thus, users of Drill would expect to find a "timestamp" type that represents a 
UTC timestamp in Unix format. The will be disappointed, however.

Drill's TIMESTAMP type is essentially a DATETIME type: it is a date/time in an 
unspecified timezone and that zone can be whatever you want it to be. UTC? 
Fine. Local? OK. Nairobi? Sure, why not.

This works fine as long as _all_ your data is in the same time zone, and you 
don't need a concept of "now". As described in DRILL-8099 and DRILL-8100, this 
is how the authors of CTAS thought of it: read Parquet data straight into Drill 
with no conversion, then write it back out to JSON with no conversion. Both 
work with UTC, so the result is fine: who cares that the 32-bit number, when in 
Drill, had no implied time zone? It is just a number we read then write. All 
good.

It is even possible to compute the difference of two DATETIMEs with unspecified 
time zone: that's what an INTERVAL does. As long as the times are actually in 
the same zone (UTC, say, or local, or Nairobi), then all is fine.

Everything collapses, however, when someone wants to know, "but how long ago 
was that event"? "Long enough ago that I need to raise the escalation level?" 
Drill has the INTERVAL type to give us the difference, but how do I get "now"? 
Drill has {{CURRENT_TIMESTAMP}}. But, how we have a problem, what timezone is 
that time in? UTC? My local timezone? Nairobi? And, what if my data is UTC but 
{{CURRENT_TIMESTAMP}} is local? Or visa-versa? The whole house of cards comes 
crashing down.

Over the years, this bug has appeared again and again. Sometimes people change 
the logic to assume TIMESTAMP is UTC. Sometimes things are changed to assume 
TIMESTAMP is local time (I've been guilty of this). Sometimes we just punt, and 
require that the machine (or test) run only in UTC, since that's the only place 
the two systems coincide.

But, in fact, I believe that the original designers of Drill meant TIMESTAMP to 
have _no_ timezone: two TIMESTAMP values could be in entirely different 
(unknown) timezones! One can see vestiges of this in the value vector code. It 
seems the original engineers imagined a "TIMESTAMP_WITH_ZONE" type, similar to 
Java's (or Joda's) {{ZonedDateTime}} type. Other bits of code (Parquet) refers 
to a never-built "TIMESTAMPZ" type for a UTC timestamp. When faced with the 
{{CURRENT_TIMESTAMP}} issue, fixes started down the path of saying that 
TIMESTAMP is local time, but this is probably a misunderstanding of the 
original design, forced upon us by the gaps in that original design.

Further, each time we make a change (such as DRILL-8099 and DRILL-8100), we 
change behavior, potentially breaking a kludge that someone found to 
kinda-sorta make things work.

Since computers can't deal with ambiguity the way humans can, we need a 
solution. It is not good enough for you to think "TIMESTAMP is UTC" and me to 
think "TIMESTAMP is local" and for Bob to think "TIMESTAMP is Java's 
{{LocalDateTime}}, it has no zone." The software needs to work one 

[jira] [Created] (DRILL-8100) JSON record writer does not convert Dril local timestamp to UTC

2022-01-01 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8100:
--

 Summary: JSON record writer does not convert Dril local timestamp 
to UTC
 Key: DRILL-8100
 URL: https://issues.apache.org/jira/browse/DRILL-8100
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Drill follows the old SQL engine convention to store the `TIMESTAMP` type in 
the local time zone. This is, of course, highly awkward in today's age when UTC 
is used as the standard timestamp in most products. However, it is how Drill 
works. (It would be great to add a `UTC_TIMESTAMP` type, but that is another 
topic.)

Each reader or writer that works with files that hold UTC timestamps must 
convert to (reader) or from (writer) Drill's local-time timestamp. Otherwise, 
Drill works correctly only when the server time zone is set to UTC.

The JSON writer does not do the proper conversion, causing tests to fail when 
run in a time zone other than UTC.

{noformat}
  @Override
  public void writeTimestamp(FieldReader reader) throws IOException {
if (reader.isSet()) {
  writeTimestamp(reader.readLocalDateTime());
} else {
  writeTimeNull();
}
  }
{noformat}

Basically, it takes a {{LocalDateTime}}, and formats it as a UTC timezone 
(using the "Z" suffix.) This is only valid if the machine is in the UTC time 
zone, which is why the test for this class attempts to force the local time 
zone to UTC, something that must users will not do.

A consequence of this bug is that "round trip" CTAS will change dates by the 
UTC offset of the machine running the CTAS. In the Pacific time zone, each 
"round trip" subtracts 8 hours from the time. After three round trips, the 
"UTC" date in the Parquet file or JSON will be a day earlier than the original 
data. One might argue that this "feature" is not always helpful.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8099) Parquet record writer does not convert Dril local timestamp to UTC

2021-12-31 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8099:
--

 Summary: Parquet record writer does not convert Dril local 
timestamp to UTC
 Key: DRILL-8099
 URL: https://issues.apache.org/jira/browse/DRILL-8099
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Drill follows the old SQL engine convention to store the `TIMESTAMP` type in 
the local time zone. This is, of course, highly awkward in today's age when UTC 
is used as the standard timestamp in most products. However, it is how Drill 
works. (It would be great to add a `UTC_TIMESTAMP` type, but that is another 
topic.)

Each reader or writer that works with files that hold UTC timestamps must 
convert to (reader) or from (writer) Drill's local-time timestamp. Otherwise, 
Drill works correctly only when the server time zone is set to UTC.

Now, perhaps we can convince must shops to run their Drill server in UTC, or at 
least set the JVM timezone to UTC. However, this still leads developers in a 
lurch: if the development machine timezone is not UTC, then some tests fail. In 
particular:

{{TestNestedDateTimeTimestamp.testNestedDateTimeCTASParquet}}

The reason that the above test fails is that the generated Parquet writer code 
assumes (incorrectly) that the Drill timestamp is in UTC and so no conversion 
is needed to write that data into Parquet. In particular, in 
{{ParquetOutputRecordWriter.getNewTimeStampConverter()}}:

{noformat}
reader.read(holder);
consumer.addLong(holder.value);
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8087) {{TestNestedDateTimeTimestamp.testNestedDateTimeCTASExtendedJson}} assumes time zone

2021-12-19 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8087:
--

 Summary: 
{{TestNestedDateTimeTimestamp.testNestedDateTimeCTASExtendedJson}} assumes time 
zone
 Key: DRILL-8087
 URL: https://issues.apache.org/jira/browse/DRILL-8087
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
 Environment: 

Reporter: Paul Rogers


Drill's date types follow older SQL engines: dates and times are assumed to be 
in the local time zone. However, most modern applications uses UTC timestamps 
to avoid the issues that crop up when using local times in systems that span 
time zones.

The {{TestNestedDateTimeTimestamp.testNestedDateTimeCTASExtendedJson}} unit 
tests seems to assume that the test runs in a particular time zone. When run on 
a machine in the Pacific time zone, the test fails:

{noformat}
java.lang.Exception: at position 0 column '`time_map`' mismatched values, 
expected: {"date":"1970-01-11","time":"00:00:03.600","timestamp":"2018-03-23 
17:40:52.123"}(JsonStringHashMap) but received 
{"date":"1970-01-11","time":"00:00:03.600","timestamp":"2018-03-23 
10:40:52.123"}(JsonStringHashMap)

Expected Records near verification failure:
Record Number: 0 { `date_list` : ["1970-01-11"],`date` : 1970-01-11,`time_list` 
: ["00:00:03.600"],`time_map` : 
{"date":"1970-01-11","time":"00:00:03.600","timestamp":"2018-03-23 
17:40:52.123"},`time` : 00:00:03.600,`timestamp_list` : ["2018-03-23 
17:40:52.123"],`timestamp` : 2018-03-23T17:40:52.123, }

Actual Records near verification failure:
Record Number: 0 { `date_list` : ["1970-01-11"],`date` : 1970-01-11,`time_list` 
: ["00:00:03.600"],`time_map` : 
{"date":"1970-01-11","time":"00:00:03.600","timestamp":"2018-03-23 
10:40:52.123"},`time` : 00:00:03.600,`timestamp_list` : ["2018-03-23 
10:40:52.123"],`timestamp` : 2018-03-23T10:40:52.123, }

For query: select * from `ctas_nested_datetime_extended_json` t1 
{noformat}

Notice the time differences: {*}17{*}:40:52.123 (expected), {*}10{*}:40:52.123 
(actual).

Since this test causes the build to fail in my time zone, the test will be 
disabled in my PR. Enabled it again when the timezone issue is fixed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8086) Convert the CSV (AKA "compliant text") reader to EVF V2

2021-12-19 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8086:
--

 Summary: Convert the CSV (AKA "compliant text") reader to EVF V2
 Key: DRILL-8086
 URL: https://issues.apache.org/jira/browse/DRILL-8086
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Work was done some time ago to convert the CSV reader to use EVF V3. Merge that 
work into the master branch.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8085) EVF V2 support in the "Easy" format plugin

2021-12-19 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8085:
--

 Summary: EVF V2 support in the "Easy" format plugin
 Key: DRILL-8085
 URL: https://issues.apache.org/jira/browse/DRILL-8085
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Add support for EVF V2 to the {{EasyFormatPlugin}} similar to how EVF V1 
support already exists. Provide examples for others to follow.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8084) Scan LIMIT pushdown fails across files

2021-12-19 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8084:
--

 Summary: Scan LIMIT pushdown fails across files
 Key: DRILL-8084
 URL: https://issues.apache.org/jira/browse/DRILL-8084
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers
Assignee: Paul Rogers


DRILL-7763 apparently added limit pushdowns to the file format plugins, which 
is a nice improvement. Unfortunately, the implementation only works for a scan 
with a single file: the limit is applied to each file independently. The 
correct implementation is to apply the limit to the {_}scan{_}, not the 
{_}file{_}.

Further, `LIMIT 0` has meaning: it asks to return a schema with no data. 
However, the implementation uses a {{maxRecords == 0}} to mean no limit, and a 
bit of code explicitly changes `LIMIT 0` to `LIMIT 1` so that "we read at least 
one file".

Consider and example. Two files, A and B, each of which have 10 records:
 * {{{}LIMIT 0{}}}: Obtain the schema from A, read no data from A. Do not open 
B. The current code changes {{LIMIT 0}} to {{{}LIMIT 1{}}}, thus returning data.
 * {{{}LIMIT 1{}}}: Read one record from A, none from B. (Don't even open B.) 
The current code will read 1 record from A and other from B.
 * {{{}LIMIT 15{}}}: Read all 10 records from A, and only 5 from B. The current 
code applies the limit of 15 to both files, thus reading 20 records.

The correct solution is to manage the {{LIMIT}} at the scan level. As each file 
completes, subtract the returned row count from the limit applied to the next 
file.

And, at the file level, there is no need to have each file count its records 
and check the limit on each row read. The "result set loader" already checks 
batch limits: it is the place to check the overall limit.

For this reason, the V2 EVF scan framework has been extended to manage the 
scan-level part, and the "result set loader" has been extended to enforce the 
per-file limit. The result is that readers need do...absolutely nothing; 
{{LIMIT}} pushdown is automatic.

EVF V1 has also been extended, but is less thoroughly tested since the desired 
path is to upgrade all readers to use EVF V2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8083) HttpdLogBatchReader creates unnecessary empty maps

2021-12-19 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-8083:
--

 Summary: HttpdLogBatchReader creates unnecessary empty maps
 Key: DRILL-8083
 URL: https://issues.apache.org/jira/browse/DRILL-8083
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.19.0
Reporter: Paul Rogers


Run the {{TestHTTPDLogReader.testStarRowSet}} test. Set a breakpoint in
{{MapWriter.SingleMapWriter.endWrite}}. Step into the {{super.endWrite()}}
method which will walk the set of child fields. Notice that there are none
for any of the several map fields.

One can see that empty maps are expected in the
{{TestHTTPDLogReader.expectedAllFieldsSchema()}} method.

Maps (i.e. tuples) are not well defined in SQL. Although Drill makes great
efforts to support them, an empty tuple is not well defined even in Drill:
there is nothing one can do with such fields.

Suggestion: don't create a map field if there are to be no members of the
map.

Affected maps:

* {{request_firstline_original_uri_query_$}}
* {{request_firstline_uri_query_$}}
* {{request_referer_last_query_$}}
* {{request_referer_query_$}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (DRILL-7325) Many operators do not set container record count

2021-04-25 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7325.

Resolution: Fixed

A number of individual commits fixed problems found in each operator. This 
overall task is now complete.

> Many operators do not set container record count
> 
>
> Key: DRILL-7325
> URL: https://issues.apache.org/jira/browse/DRILL-7325
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.19.0
>
>
> See DRILL-7324. The following are problems found because some operators fail 
> to set the record count for their containers.
> h4. Scan
> TestComplexTypeReader, on cluster setup, using the PojoRecordReader:
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from ScanBatch
> ScanBatch: Container record count not set
> Reason: ScanBatch never sets the record count of its container (this is a 
> generic issue, not specific to the PojoRecordReader).
> h4. Filter
> {{TestComplexTypeReader.testNonExistentFieldConverting()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from FilterRecordBatch
> FilterRecordBatch: Container record count not set
> {noformat}
> h4. Hash Join
> {{TestComplexTypeReader.test_array()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from HashJoinBatch
> HashJoinBatch: Container record count not set
> {noformat}
> Occurs on the first batch in which the hash join returns {{OK_NEW_SCHEMA}} 
> with no records.
> h4. Project
> TestCsvWithHeaders.testEmptyFile()}} (when the text reader returned empty, 
> schema-only batches):
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from ProjectRecordBatch
> ProjectRecordBatch: Container record count not set
> {noformat}
> Occurs in {{ProjectRecordBatch.handleNullInput()}}: it sets up the schema but 
> does not set the value count to 0.
> h4. Unordered Receiver
> {{TestCsvWithSchema.testMultiFileSchema()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from UnorderedReceiverBatch
> UnorderedReceiverBatch: Container record count not set
> {noformat}
> The problem is that {{RecordBatchLoader.load()}} does not set the container 
> record count.
> h4. Streaming Aggregate
> {{TestJsonReader.testSumWithTypeCase()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from StreamingAggBatch
> StreamingAggBatch: Container record count not set
> {noformat}
> The problem is that {{StreamingAggBatch.buildSchema()}} does not set the 
> container record count to 0.
> h4. Limit
> {{TestJsonReader.testDrill_1419()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from LimitRecordBatch
> LimitRecordBatch: Container record count not set
> {noformat}
> None of the paths in {{LimitRecordBatch.innerNext()}} set the container 
> record count.
> h4. Union All
> {{TestJsonReader.testKvgenWithUnionAll()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from UnionAllRecordBatch
> UnionAllRecordBatch: Container record count not set
> {noformat}
> When {{UnionAllRecordBatch}} calls 
> {{VectorAccessibleUtilities.setValueCount()}}, it did not also set the 
> container count.
> h4. Hash Aggregate
> {{TestJsonReader.drill_4479()}}:
> {noformat}
> ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
> from HashAggBatch
> HashAggBatch: Container record count not set
> {noformat}
> Problem is that {{HashAggBatch.buildSchema()}} does not set the container 
> record count to 0 for the first, empty, batch sent for {{OK_NEW_SCHEMA.}}
> h4. And Many More
> I turns out that most operators fail to set one of the many row count 
> variables somewhere in their code path: maybe in the schema setup path, maybe 
> when building a batch along one of the many paths that operators follow. 
> Further, we have multiple row counts that must be set:
> * Values in each vector ({{setValueCount()}},
> * Row count in the container ({{setRecordCount()}}), which must be the same 
> as the vector value count.
> * Row count in the operator (batch), which is the (possibly filtered) count 
> of records presented to downstream operators. It must be less than or equal 
> to the container row count (except for an SV4.)
> * The SV2 record count, which is the number of entries in the SV2 and must be 
> the same as the batch row count (and less or equal to the container row 
> count.)
> * The SV2 actual bactch record count, which must be the same as the container 
> row count.
> * The SV4 record 

[jira] [Resolved] (DRILL-6953) Merge row set-based JSON reader

2021-04-25 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-6953.

Resolution: Fixed

Resolved via  series of individual tickets.

> Merge row set-based JSON reader
> ---
>
> Key: DRILL-6953
> URL: https://issues.apache.org/jira/browse/DRILL-6953
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.19.0
>
>
> The final step in the ongoing "result set loader" saga is to merge the 
> revised JSON reader into master. This reader does two key things:
> * Demonstrates the prototypical "late schema" style of data reading (discover 
> schema while reading).
> * Implements many tricks and hacks to handle schema changes while loading.
> * Shows that, even with all these tricks, the only true solution is to 
> actually have a schema.
> The new JSON reader:
> * Uses an expanded state machine when parsing rather than the complex set of 
> if-statements in the current version.
> * Handles reading a run of nulls before seeing the first data value (as long 
> as the data value shows up in the first record batch).
> * Uses the result-set loader to generate fixed-size batches regardless of the 
> complexity, depth of structure, or width of variable-length fields.
> While the JSON reader itself is helpful, the key contribution is that it 
> shows how to use the entire kit of parts: result set loader, projection 
> framework, and so on. Since the projection framework can handle an external 
> schema, it is also a handy foundation for the ongoing schema project.
> Key work to complete after this merger will be to reconcile actual data with 
> the external schema. For example, if we know a column is supposed to be a 
> VarChar, then read the column as a VarChar regardless of the type JSON itself 
> picks. Or, if a column is supposed to be a Double, then convert Int and 
> String JSON values into Doubles.
> The Row Set framework was designed to allow inserting custom column writers. 
> This would be a great opportunity to do the work needed to create them. Then, 
> use the new JSON framework to allow parsing a JSON field as a specified Drill 
> type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7789) Exchanges are slow on large systems & queries

2020-09-23 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7789:
--

 Summary: Exchanges are slow on large systems & queries
 Key: DRILL-7789
 URL: https://issues.apache.org/jira/browse/DRILL-7789
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Paul Rogers


A user with moderate-sized cluster and query has experienced extreme slowness 
in exchanges. Up to 11/12 of the time is spent waiting in one query, 3/4 of 
time spent waiting in another. We suspect that exchanges are somehow 
serializing across the cluster.

Cluster:
 * Drill 1.16 (MapR version)
 * MapR-FS
 * Data stored in a 8GB Parquet file, unpacks to about 80 GB, 20B records
 * 4 Drillbits
 * Each node has 56 cores, 400 GB of memory
 * Drill queries run with 40 fragments (70% of CPU) and 80 GB of memory

The query is, essentially:

{noformat}
Parquet writer
- Hash Join
  - Scan
  - Window, Sort
  - Window, Sort
  - Hash Join
- Scan
- Scan
{noformat}

In the above, each line represents a fragment boundary. The plan includes mux 
exchanges between the two "lower" scans and the hash join.

The total query  time is 6 hours. Of that, 30 minutes is spent working, the 
other 5.5 hours is spent waiting. (The 30 minutes is obtained by summing the 
"Avg Runtime" column in the profile.)

When checking resource usage with "top", we found that only a small amount of 
CPU was used. We should have seen 4000% (40 cores) but we actually saw just 
around 300-400%. This again indicates that the query spent most of its time 
doing nothing: not using CPU.

In particular the sender spends about 5 hours waiting for the receiver, which 
in turn spends about 5 hours waiting for the sender. This pattern occurs in 
every exchange in the "main" data path (the 20B records.)

As an experiment, the user disabled Mux exchanges. The system became overloaded 
at 40 fragments per node, so parallelism was reduced to 20. Now, the partition 
sender waited for the unordered receiver and visa-versa.

The original query incurred spilling. We hypothesized that the spilling caused 
delays which somehow rippled through the DAG. However, the user revised the 
query to eliminate spilling and to reduce the query to just the "bottom" hash 
join. The query ran for an hour, of which 3/4 of the time was again spent with 
senders and receivers waiting for each other.

We have eliminated a number of potential causes:

* System has sufficient memory
* MapRFS file system has plenty of spindles and plenty of I/O capability.
* Network is fast
* No other load on the nodes
* Query was simplified down to the simplest possible: a single join (with 
exchanges)
* If the query is simplified further (scan and write to Parquet, no join), it 
completes in just a few minutes: about as fast as the disk I/O rate.

The query profile does not provide sufficient information to dig further. The 
profile provides aggregate wait times, but does not, say, tell us which 
fragments wait for which other fragments for how long.

We believe that, if the exchange delays are fixed, the query which takes six 
hours should complete in less than a half hour -- even with shuffles, spilling, 
reading from Parquet and writing to Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7734) Revise the result set reader

2020-05-05 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7734:
--

 Summary: Revise the result set reader
 Key: DRILL-7734
 URL: https://issues.apache.org/jira/browse/DRILL-7734
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Updates to the {{ResultSetReader}} abstractions to make them usable in more 
cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7733) Use streaming for REST JSON queries

2020-05-05 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7733:
--

 Summary: Use streaming for REST JSON queries
 Key: DRILL-7733
 URL: https://issues.apache.org/jira/browse/DRILL-7733
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Several uses on the user and dev mail lists have complained about the memory 
overhead when running a REST JSON query: {{http:://node:8047/query.json}}. The 
current implementation buffers the entire result set in memory, then lets 
Jersey/Jetty convert the results to JSON. The result is very heavy heap use for 
larger query result sets.

This ticket requests a change to use streaming. As each batch arrives at the 
Screen operator, convert that batch to JSON and directly stream the results to 
the client network connection, much as is done for the native client connection.

For backward compatibility, the form of the JSON must be the same as the 
current API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7729) Use java.time in column accessors

2020-05-04 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7729:
--

 Summary: Use java.time in column accessors
 Key: DRILL-7729
 URL: https://issues.apache.org/jira/browse/DRILL-7729
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Use {{java.time}} classes in the column accessors, except for {{Interval}}, 
which has no {{java.time}} equivalent. Doing so allows us to create a row-set 
version of Drill's JSON writer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7728) Drill SPI framework

2020-05-01 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7728:
--

 Summary: Drill SPI framework
 Key: DRILL-7728
 URL: https://issues.apache.org/jira/browse/DRILL-7728
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Provide the basic framework to load an extension in Drill, modelled after the 
Java Service Provider concept. Excludes full class loader isolation for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7725) Updates to EVF2

2020-04-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7725:
--

 Summary: Updates to EVF2
 Key: DRILL-7725
 URL: https://issues.apache.org/jira/browse/DRILL-7725
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Enhancements to the "version 2" of the "Enhanced Vector Framework" to prepare 
for upgrading the text reader to EVF2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7724) Refactor metadata controller batch

2020-04-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7724:
--

 Summary: Refactor metadata controller batch
 Key: DRILL-7724
 URL: https://issues.apache.org/jira/browse/DRILL-7724
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


A debugging session revealed opportunities to simplify 
{{MetadataControllerBatch}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7717) Support Mongo extended types in V2 JSON loader

2020-04-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7717:
--

 Summary: Support Mongo extended types in V2 JSON loader
 Key: DRILL-7717
 URL: https://issues.apache.org/jira/browse/DRILL-7717
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Drill supports Mongo's extended types in the V1 JSON reader. Add similar 
support to the V2 version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7711) Add data path, parameter filter pushdown to HTTP plugin

2020-04-18 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7711:
--

 Summary: Add data path, parameter filter pushdown to HTTP plugin
 Key: DRILL-7711
 URL: https://issues.apache.org/jira/browse/DRILL-7711
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Add to the new HTTP plugin two new features:

 * The ability to express a path to the data to avoid having to work with 
complex message objects in SQL.
 * The ability to specify HTTP parameters using filter push-downs from SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7709) CTAS as CSV creates files which the "csv" plugin can't read

2020-04-17 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7709:
--

 Summary: CTAS as CSV creates files which the "csv" plugin can't 
read
 Key: DRILL-7709
 URL: https://issues.apache.org/jira/browse/DRILL-7709
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.17.0
Reporter: Paul Rogers


Change the output format to JSON and create a CSV file:
{noformat}
ALTER SESSION SET `store.format` = 'csv';
CREATE TABLE foo AS ...
 {noformat}

You will end up with a directory "foo" that contains a CSV file: "0_0_0.csv". 
Now, try to query that file:

{noformat}
SELECT * FROM foo
{noformat}

The query will fail, or return incorrect results, because in Drill, the "csv" 
read format is CSV *without* headers. But, on write, "csv" is CSV *with* 
headers.

The (very messy) workaround is to manually rename all the files to use the 
".csvh" suffix, or to create a separate storage plugin config for that target 
with a new "csv" format plugin that does not have headers.

Expected that if I create a file in Drill I should be able to immediately read 
that file without extra hokey-pokey.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7708) Downgrade maven from 3.6.3 to 3.6.0

2020-04-17 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7708:
--

 Summary: Downgrade maven from 3.6.3 to 3.6.0
 Key: DRILL-7708
 URL: https://issues.apache.org/jira/browse/DRILL-7708
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


DRILL-7704 upgraded Drill's Maven version to 3.6.3.


As it turns out, I use Ubuntu (Linux Mint) for development. Maven is installed 
as a package using apt-get. Packages can lag behind a bit. The latest maven 
available via apt-get is 3.6.0.


It is a nuisance to install a new version outside the package manager. I 
changed the Maven version in the root pom.xml to 3.6.0 and the build seemed to 
work. Any reason we need the absolute latest version rather than just 3.6.0 or 
later?


The workaround for now is to manually edit the pom.xml file on each checkout, 
then revert the change before commit. This ticket requests to adjust the 
"official" version to 3.6.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7655) Add Default Schema text box to Edit Query page in query profile

2020-04-15 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7655.

Resolution: Fixed

Fixed as part of PR #2052.

> Add Default Schema text box to Edit Query page in query profile
> ---
>
> Key: DRILL-7655
> URL: https://issues.apache.org/jira/browse/DRILL-7655
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.18.0
>Reporter: Vova Vysotskyi
>Assignee: Paul Rogers
>Priority: Major
> Fix For: Future
>
> Attachments: image-2020-03-21-01-44-15-062.png, 
> image-2020-03-21-01-44-57-172.png, image-2020-03-21-01-45-24-782.png
>
>
> In DRILL-7603 was added functionality to specify default schema for query in 
> Drill Web UI when submitting the query.
> Also, the query may be resubmitted from the profiles page, and for the case 
> when the query was submitted with specified default schema, its resubmission 
> will fail.
> The aim of this Jira is to add Default Schema text box to this page and 
> populate it with schema specified for the specific query if possible.
> !image-2020-03-21-01-44-15-062.png!
>  
> !image-2020-03-21-01-44-57-172.png!
>  
> !image-2020-03-21-01-45-24-782.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7703) Support for 3+D arrays in EVF JSON loader

2020-04-15 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7703:
--

 Summary: Support for 3+D arrays in EVF JSON loader
 Key: DRILL-7703
 URL: https://issues.apache.org/jira/browse/DRILL-7703
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Add support for multiple levels of repeated list to the new EVF-based JSON 
reader.

As work continues on adding the new JSON reader to Drill, running unit tests 
reveals that some include list with three (perhaps more) dimensions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7701) EVF V2 Scan Framework

2020-04-14 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7701:
--

 Summary: EVF V2 Scan Framework
 Key: DRILL-7701
 URL: https://issues.apache.org/jira/browse/DRILL-7701
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Scan framework for the "V2" EVF schema resolution committed in DRILL-7696.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7685) Case statement marking column as required in parquet metadata

2020-04-12 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7685.

Resolution: Cannot Reproduce

Tested in Drill 1.18 (snapshot) and found that the provided query works fine. 
Suggested the user try the newer Drill version.

If you still have a problem please reopen this bug and provide another example 
so we can locate and fix the issue, if it still exists in the latest code.

> Case statement marking column as required in parquet metadata
> -
>
> Key: DRILL-7685
> URL: https://issues.apache.org/jira/browse/DRILL-7685
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.16.0
>Reporter: Nitin Pawar
>Assignee: Paul Rogers
>Priority: Minor
>
> We use apache drill for multi step processing.
> In one of the steps we have query as below
> ~create table dfs.tmp.`/t2` as select employee_id, case when department_id is 
> not null then 1 else 2 end as case_output from cp.`employee.json`;~
> This provides output as 
> employee_id: OPTIONAL INT64 R:0 D:1
> case_output: REQUIRED INT32 R:0 D:0
> If we remove the end statement from case it does mark the column as optional.
>  
> We feed this output to covariance function and because of this we get an 
> error like below 
> Error: Missing function implementation: [covariance(BIGINT-OPTIONAL, 
> INT-REQUIRED)]. Full expression: --UNKNOWN EXPRESSION--
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7697) Revise query editor in profile page of web UI

2020-04-12 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7697:
--

 Summary: Revise query editor in profile page of web UI
 Key: DRILL-7697
 URL: https://issues.apache.org/jira/browse/DRILL-7697
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


Drill has two separate query editors:

* The one displayed from the Query tab
* The one displayed from the Edit Query tab within Profiles

The two editors do basically the same thing, but have evolved as copies that 
have diverged.

* The Query tab editor places the three query types above the query text box, 
while the Profiles version puts the same control below the query text box.
* Similarly the Query tab editor puts the Ctrl+Enter hint above the text box, 
Profiles puts it below.

A first request is to unify the two editors. In particular, move the code to a 
common template file included in both places.

Second, the Profiles editor is a bit redundant.

* Displays a "Cancel Query" button even if the query is completed. Hide this 
button for completed queries. (Since there is a race condition, hide it for 
queries completed at the time the page was created.)
* No need to ask the user for the query type. The profile should include the 
type and the type should be a fixed field in Profiles.
* Similarly, the limit and (in Drill 1.18) the Default Schema should also be 
recorded in the query plan and fixed.

Finally, since system/session options can affect a query, and are part of the 
query plan, show those in the query as well so it can be rerun in the same 
environment in which it originally ran.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-6672) Drill table functions cannot handle "setFoo" accessors

2020-04-11 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-6672.

Resolution: Not A Problem

Storage and format plugins must be immutable since their entire values are used 
as keys in an internal map (plugin registry and format plugin tables.) So, no 
config should have a "setFoo()" method.

> Drill table functions cannot handle "setFoo" accessors
> --
>
> Key: DRILL-6672
> URL: https://issues.apache.org/jira/browse/DRILL-6672
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> Consider an example format plugin, such as the regex one used in the Drill 
> book. (GitHub reference needed.) We can define the plugin using getters and 
> setters like this:
> {code}
> public class RegexFormatConfig implements FormatPluginConfig {
>   private String regex;
>   private String fields;
>   private String extension;
>   public void setRegex(String regex) { this.regex = regex; }
>   public void setFields(String fields) { this.fields = fields; }
>   public void setExtension(String extension) { this.extension = extension; }
> {code}
> We can then create a plugin configuration using the Drill Web console, the 
> {{bootstrap-storage-plugins.json}} and so on. All work fine.
> Suppose we try to define a configuration using a Drill table function:
> {code}
>   final String sql = "SELECT * FROM table(cp.`regex/simple.log2`\n" +
>   "(type => 'regex',\n" +
>   " extension => 'log2',\n" +
>   " regex => '(dddd)-(dd)-(dd) 
> .*',\n" +
>   " fields => 'a, b, c, d'))";
> {code}
> We get this error:
> {noformat}
> org.apache.drill.common.exceptions.UserRemoteException: PARSE ERROR: 
> can not set value (\d\d\d\d)-(\d\d)-(\d\d) .* to parameter regex: class 
> java.lang.String
> table regex/simple.log2
> parameter regex
> {noformat}
> The reason is that the code that handles table functions only knows how to 
> set public fields, it does not know about the Java Bean getter/setter 
> conventions used by Jackson:
> {code}
> package org.apache.drill.exec.store.dfs;
> ...
> final class FormatPluginOptionsDescriptor {
>   ...
>   FormatPluginConfig createConfigForTable(TableInstance t) {
> ...
> Field field = pluginConfigClass.getField(paramDef.name);
> ...
> }
> field.set(config, param);
>   } catch (IllegalAccessException | NoSuchFieldException | 
> SecurityException e) {
> throw UserException.parseError(e)
> .message("can not set value %s to parameter %s: %s", param, 
> paramDef.name, paramDef.type)
> ...
> {code}
> The only workaround is to make all fields public:
> {code}
> public class RegexFormatConfig implements FormatPluginConfig {
>   public String regex;
>   public String fields;
>   public String extension;
> {code}
> Since public fields are not good practice, please modify the table function 
> mechanism to follow Jackson conventions and allow Java Bean style setters. 
> (Or better, fix DRILL-6673 to allow immutable format objects via the use of a 
> constructor.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7696) EVF v2 Scan Schema Resolution

2020-04-09 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7696:
--

 Summary: EVF v2 Scan Schema Resolution
 Key: DRILL-7696
 URL: https://issues.apache.org/jira/browse/DRILL-7696
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Revises the mechanism EVF uses to resolve the schema for a scan. See PR for 
details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7690) Display (major) operators in fragment title bar in Web UI

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7690:
--

 Summary: Display (major) operators in fragment title bar in Web UI
 Key: DRILL-7690
 URL: https://issues.apache.org/jira/browse/DRILL-7690
 Project: Apache Drill
  Issue Type: Improvement
  Components: Web Server
Affects Versions: 1.17.0
Reporter: Paul Rogers


Run a query in the Drill Web Console. View the profile, Query tab. Scroll down 
to the list of fragments. You'll see a gray bar with a title such as

Major Fragment: 02-xx-xx

This section shows the timing of the fragments.

But, what is happening in this fragment? To find out we must scroll way down to 
the lower section where we see:


02-xx-00 - SINGLE_SENDER
02-xx-01 - SELECTION_VECTOR_REMOVER
02-xx-02 - LIMIT
02-xx-03 - SELECTION_VECTOR_REMOVER
02-xx-04 - TOP_N_SORT
02-xx-05 - UNORDERED_RECEIVER

The result is quite a bit of scroll down/scroll up.

This ticket asks to show the major operators in the fragment title. For 
example, for the above:

Major Fragment: 02-xx-xx (TOP_N_SORT, LIMIT)

The "minor" operators which are omitted (because they are not the focus of the 
fragment) include senders, receivers and the SVR.

Note that the operators should appear in data flow order (bottom to top).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7689) Do not save profiles for trivial queries

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7689:
--

 Summary: Do not save profiles for trivial queries
 Key: DRILL-7689
 URL: https://issues.apache.org/jira/browse/DRILL-7689
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


Drill saves a query profile for every query. Some queries are trivial; there is 
no useful information (for the user) in such queries. Examples include {{ALTER 
SESSION/SYSTEM}}, {{CREATE SCHEMA}}, and other internal commands.

Logic already exists to omit profiles for {{ALTER}} commands, but only if a 
session option is set. No ability exists to omit profiles for the other 
statements.

This ticket asks to:
 * Omit profiles for trivial commands by default. (Part of the task is to 
define the set of trivial commands.)
 * Provide an option to enable such profiles, primarily for use by developers 
when debugging the trivial commands.
 * If no profile is available, show a message to that effect in the Web UI 
where we currently display the profile number. Provide a link to the 
documentation page that explains why there is no profile (and how to use the 
above option to request a profile if needed.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7688) Provide web console option to see non-default options

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7688:
--

 Summary: Provide web console option to see non-default options
 Key: DRILL-7688
 URL: https://issues.apache.org/jira/browse/DRILL-7688
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


The Drill web console has evolved to become quite powerful. The Options page 
has many wonderful improvements over earlier versions. The "Default" button is 
a handy way to see which options have been set, and to reset options to their 
default values.

When testing and troubleshooting, it is helpful to identify those options which 
are not at their default values. Please add a filter at the top of the page for 
"non-default" in addition to the existing topic-based filters.

It may also be useful to add a bit more color to the "Default" button when an 
option is set. At present, the distinction is gray vs. black text which is 
better than it was. Would be better for there to be even more contrast so 
non-default values are easier to see.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7687) Inaccurate memory estimates in hash join

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7687:
--

 Summary: Inaccurate memory estimates in hash join
 Key: DRILL-7687
 URL: https://issues.apache.org/jira/browse/DRILL-7687
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.15.0
Reporter: Paul Rogers


See DRILL-7675. In this ticket, we tried to reproduce an OOM case in the 
partition sender. In so doing, we mucked with various parallelization options. 
The query has 2 MB of data, but at one point the query would fail to run 
because the hash join could not obtain enough memory (on a system with 8 GB of 
memory available.)

The problem is that the memory calculator sees a worst-case scenario: a row 
with 250+ columns. The hash join estimated it needed something like 650MB of 
memory to perform the join. (That is 650 MB per fragment, and there were 
multiple fragments.) Since there was insufficient memory, and the 
{{drill.exec.hashjoin.fallback.enabled}} option was disabled, the hash join 
failed before it even started.

Better would be to at least try the query. In this case, with 2MB of data, the 
query succeeds. (Had to enable the magic option to do so.)

Better also would be to use the estimated row counts when estimating memory 
use. Maybe better estimates for the amount of memory needed per row. (The data 
in question has multiple nested map arrays, causing cardinality estimates to 
grow by 5x at each level.)

Perhaps use the "batch sizing" mechanism to detect actual memory use by 
analyzing the incoming batch.

There is no obvious answer. However, the goal is clear: the query should 
succeed if the actual memory needed fits within that available; we should not 
fail proactively based on estimates of needed memory. (This what the 
{{drill.exec.hashjoin.fallback.enabled}} option does; perhaps it should be on 
by default.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7686) Excessive memory use in partition sender

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7686:
--

 Summary: Excessive memory use in partition sender
 Key: DRILL-7686
 URL: https://issues.apache.org/jira/browse/DRILL-7686
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.14.0
Reporter: Paul Rogers


The Partition Sender in Drill is responsible to take a batch from fragment x, 
and send its rows to all other fragments f1, f2, ... fn. For example, when 
joining, fragment x might read from a portion of a file, hash the join key, and 
partition rows by hash key to the receiving fragments that join rows with that 
same key.

Since Drill is columnar, the sender needs to send a batch of columns to each 
receiver. To be efficient, that batch should contain a reasonable number of 
rows. The current default is 1024.

Drill creates buffers, one per sender, to gather the rows. Thus, each sender 
needs n buffers: one for each receiver.

Because Drill is symmetrical, there are n senders (scans). Since each maintains 
n send buffers, we have a total of n^2 buffers. That is, the amount of memory 
used by the partition sender grows with the square of the degree of parallelism 
for a query.

In addition, as seen in DRILL-7675, the size of the buffers is controlled not 
by Drill, but by the incoming data. The query in DRILL-7675 had a row with 260+ 
fields, some of which were map arrays.

The result is that the query, which processes 2 MB of data, runs out of memory 
when may GB are available. Drill is simply doing the math: n^2 buffers, each 
with 1024 rows, each with 250 fields, many with a cardinality of 5x (or 25x or 
125x, depending on array depth) of the row count. The result is a very large 
memory footprint.

There is no simple bug-fix solution: the design is inherently unbounded. This 
ticket asks to develop a new design. Some crude ideas:
 * Use a row-based format for sending to avoid columnar overhead.
 * Send rows as soon as they are available on the sender side; allow the 
receiver to do buffering.
 * If doing buffering, flush rows after x ms to avoid slowing the system. (The 
current approach waits for buffers to fill.)
 * Consolidate buffers on each sending node. (This is the Mux/DeMux approach 
which is in the code, but was never well understood, and has its own 
concurrency, memory ownership problems.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7683) Add "message parsing" to new JSON loader

2020-03-31 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7683:
--

 Summary: Add "message parsing" to new JSON loader
 Key: DRILL-7683
 URL: https://issues.apache.org/jira/browse/DRILL-7683
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Worked on a project that uses the new JSON loader to parse a REST response that 
includes a set of "wrapper" fields around the JSON payload. Example:

{code:json}
{ "status": "ok", "results: [ data here ]}
{code}

To solve this cleanly, added the ability to specify a "message parser" to 
consume JSON tokens up to the start of the data. This parser can be written as 
needed for each different data source.

Since this change adds one more parameter to the JSON structure parser, added 
builders to gather the needed parameters rather than making the constructor 
even larger.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7680) Move UDF projects before plugins in contrib

2020-03-31 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7680:
--

 Summary: Move UDF projects before plugins in contrib
 Key: DRILL-7680
 URL: https://issues.apache.org/jira/browse/DRILL-7680
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Several {{contrib}} plugins depend on UDFs for testing. However, the UDFs occur 
after the plugins in build order. This PR reverses the dependencies so that 
UDFs are built before the plguins that want to use them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7658) Vector allocateNew() has poor error reporting

2020-03-24 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7658:
--

 Summary: Vector allocateNew() has poor error reporting
 Key: DRILL-7658
 URL: https://issues.apache.org/jira/browse/DRILL-7658
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.17.0
Reporter: Paul Rogers


See posting by Charles on 2020-03-24 on the user and dev lists of a message 
forwarded from another user where a query ran out of memory. Stack trace:

{noformat}
Caused by: org.apache.drill.exec.exception.OutOfMemoryException: null
    at 
org.apache.drill.exec.vector.complex.AbstractContainerVector.allocateNew(AbstractContainerVector.java:59)
    at 
org.apache.drill.exec.test.generated.PartitionerGen5$OutgoingRecordBatch.allocateOutgoingRecordBatch(PartitionerTemplate.
{noformat}

Notice the complete lack of context. The method in question:

{code:java}
  public void allocateNew() throws OutOfMemoryException {
   if (!allocateNewSafe()) {
 throw new OutOfMemoryException();
 }
   }
{code}

A generated implementation of the {{allocateNewSafe()}} method:

{code:java}
  @Override
  public boolean allocateNewSafe() {
long curAllocationSize = allocationSizeInBytes;
if (allocationMonitor > 10) {
  curAllocationSize = Math.max(8, curAllocationSize / 2);
  allocationMonitor = 0;
} else if (allocationMonitor < -2) {
  curAllocationSize = allocationSizeInBytes * 2L;
  allocationMonitor = 0;
}

try{
  allocateBytes(curAllocationSize);
} catch (DrillRuntimeException ex) {
  return false;
}
return true;
  }
{code}

Note that the {{allocateNew()}} method is not "safe" (it throws an exception), 
but it does so by discarding the underlying exception. What should happen is 
that the "non-safe" {{allocateNew()}} should call the {{allocateBytes()}} 
method and simply forward the {{DrillRuntimeException}}. It probably does not 
do so because the author wanted to reuse the extra size calcs in 
{{allocateNewSafe()}}.

The solution is to put the calcs and the call to {{allocateBytes()}} in a 
"non-safe" method, and call that entire method from {{allocateNew()}} and 
{{allocateNewSafe()}}.  Or, better, generate {{allocateNew()}} using the above 
code, but have the base class define {{allocateNewSafe()}} as a wrapper.

Note an extra complexity: although the base class provides the method shown 
above, each generated vector also provides:

{code:java}
  @Override
  public void allocateNew() {
if (!allocateNewSafe()) {
  throw new OutOfMemoryException("Failure while allocating buffer.");
}
  }
{code}

Which is both redundant and inconsistent (one has a message, the other does 
not.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7640) EVF-based JSON Loader

2020-03-12 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7640:
--

 Summary: EVF-based JSON Loader
 Key: DRILL-7640
 URL: https://issues.apache.org/jira/browse/DRILL-7640
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Builds on the JSON structure parser and several other PRs to provide an 
enhanced, robust mechanism to read JSON data into value vectors via the EVF. 
This is not the JSON reader, rather it is the "V2" version of the 
\{{JsonProcessor}} which does the actual JSON parsing/loading work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7634) Rollup of code cleanup changes

2020-03-10 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7634:
--

 Summary: Rollup of code cleanup changes
 Key: DRILL-7634
 URL: https://issues.apache.org/jira/browse/DRILL-7634
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Pack of cosmetic code cleanup changes accumulated over recent months.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7633) Fixes for union and repeated list accessors

2020-03-09 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7633:
--

 Summary: Fixes for union and repeated list accessors
 Key: DRILL-7633
 URL: https://issues.apache.org/jira/browse/DRILL-7633
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Minor fixes for repeated list and Union type support in column accessors



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7632) Improve user exception formatting

2020-03-09 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7632:
--

 Summary: Improve user exception formatting
 Key: DRILL-7632
 URL: https://issues.apache.org/jira/browse/DRILL-7632
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Modify the user exception to insert a colon between the "context" title and 
value. Old style:

{noformat}
My Context value
{noformat}

Revised:

{noformat}
My Context: value
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7631) Updates to the Json Structure Parser

2020-03-09 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7631:
--

 Summary: Updates to the Json Structure Parser
 Key: DRILL-7631
 URL: https://issues.apache.org/jira/browse/DRILL-7631
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Updates to the JSON structure parser based on using it to create a revised JSON 
record reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7613) Revise, harden the preliminary storage plugin upgrade facility

2020-02-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7613:
--

 Summary: Revise, harden the preliminary storage plugin upgrade 
facility
 Key: DRILL-7613
 URL: https://issues.apache.org/jira/browse/DRILL-7613
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


Drill provides a way to upgrade storage plugins after installing a new Drill 
version. However the mechanism is very crude and error prone. It is based on 
overwriting the current contents of the persistent store with a new set of 
configs. However, doing so is likely to discard essential user config.

For example, we cannot upgrade format plugins individually. So, to add a new 
format plugin, we must overwrite, say, the {{dfs}} config. In so doing we may 
throw away the user's S3 or HDFS config.

Further, we don't want to reapply the upgrades on every restart. So, the 
mechanism has the ability to delete a file to mark the system as upgraded. 
There are several problems with this idea. First, any such file is likely to be 
in a jar as a resource, so is not deletable except in an IDE (when we would 
really *not* want to delete it.)

Suppose the user does an upgrade, suffers a ZK loss, and restores from backup. 
Drill will not know to re-upgrade ZK because the file is gone (assuming that 
the delete actually worked.) This shows that using a file to indicate ZK state 
is a poor implementation choice.

The code does not handle race conditions. If we bring up a cluster of 10 
Drillbits, all 10 will race to perform upgrades.

The code has partial code to upgrade format plugins, but that code does not 
actually work as there is no no good way to do that upgrade. (Each DFS storage 
plugin has its own set of format plugins, unfortunately, and there is no code 
to find all such DFS storage plugins and apply format plugin updates.)

A better solution would be to:

* Store a version in the plugin registry. Upgrade only if the version in the 
registry is lower than the current Drillbit version.
* Better, provide a SQL command to force the upgrade. This allows users to do 
rolling v-1/v upgrades without the v-1 Drillbits seeing plugins that they 
cannot handle.
* Implement a race-condition-proof upgrade: select one Drillbit to do the 
upgrade and let the others wait. (Leader election.)
* Separate format plugins from the DFS storage plugin. Allow the same formats 
to be used across all configured DFS plugins. (There is no harm in offering a 
format for non-existent files.)
* Complete the work to upgrade the (now separate) format plugins so we can 
automatically roll out new formats without users having to do the upgrade 
manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7612) Modify the ExcelFormatConfig immutable

2020-02-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7612:
--

 Summary: Modify the ExcelFormatConfig immutable
 Key: DRILL-7612
 URL: https://issues.apache.org/jira/browse/DRILL-7612
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers


See note in the {{ExcelFormatConfig}} class. The class is designed with mutable 
fields. However, if any of the fields are actually changed while the excel 
reader is active, the result is undefined.

Would be better for the class to be immutable with {{final}} fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7611) Minor improvements to the Option Manager

2020-02-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7611:
--

 Summary: Minor improvements to the Option Manager
 Key: DRILL-7611
 URL: https://issues.apache.org/jira/browse/DRILL-7611
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


Based on a recent contribution in  DRILL-7603, #1996 suggests some minor 
improvements to the option manager:

* Provide a `getInt(String name)` method that verifies that a `long` is in the 
`int` range and does the cast. Avoids casts in each client location.
* Provide a `setOption(String name, Object value)` method in 
`BaseOptionManager`.
* Then, build on that with a `setOptions)` method.
* Rename the `setLocalOption` methods to `set()` (the "local option" part 
is implied.)
* Rename `setLocalOption(String name, String value)` to `setString()`.
* Rename `void setLocalOption(String name, Object value)` to `setOption()`. 
(Avoids the String/Object ambiguity in the current methods.)

Then, we reuse these methods in the several other places where we do such 
things. In the code in DRILL-7603 and in the `ClusterTest` code which mucks 
with options.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7610) Allow user to specify table schema in Metastore

2020-02-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7610:
--

 Summary: Allow user to specify table schema in Metastore
 Key: DRILL-7610
 URL: https://issues.apache.org/jira/browse/DRILL-7610
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


The Drill Metastore will infer the schema of a table while gathering stats. 
Unfortunately, this creates a chicken-and-egg problem. Some files need the 
Metastore because the schema is ambiguous. Such data won't even scan correctly 
without such information. Classic JSON example:

{code:json}
{a: 10} {a: 10.1}
{code}

In these cases, the user should first define the table schema, then run the 
{{ANALYZE TABLE}} commands. In such cases, Drill should not attempt to change 
the type information (since the actual data is ambiguous.)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7609) Display query in Web UI results page

2020-02-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7609:
--

 Summary: Display query in Web UI results page
 Key: DRILL-7609
 URL: https://issues.apache.org/jira/browse/DRILL-7609
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


Use the Drill Web console. Submit a query (such as the sample one.) You'll see 
the results page. Now, what query did you just run? The page does not display 
it, though it does display the (somewhat uninformative) query ID.

Suggestion: instead of the profile UI, show the query (or at least the first 
few lines.) Allow a single-click traversal back to the submit page for that 
query for the typical case that the first cut at the query had an error, is not 
quite right, etc.

Have another link (as now) to the query profile, which is typically use less 
often than editing the query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7608) Web UI: Avoid wait UI for short queries

2020-02-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7608:
--

 Summary: Web UI: Avoid wait UI for short queries
 Key: DRILL-7608
 URL: https://issues.apache.org/jira/browse/DRILL-7608
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


A recent change to the web UI shows a nifty wait dialog after submitting a 
query. For queries that run in a fraction of a second, the dialog pops and 
disappears in an eyeblink, which is disconcerting. Maybe insert a small delay 
before showing the dialog.

Or, better, go to the results page and show the "waiting" dialog in place of 
results until
results are ready.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7606) Support Hive client and JDBC APIs

2020-02-25 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7606:
--

 Summary: Support Hive client and JDBC APIs
 Key: DRILL-7606
 URL: https://issues.apache.org/jira/browse/DRILL-7606
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


Both Hive and Impala implement the server-side protocol for the [Hive client 
API|https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC].
 Some internals documentation is 
[here|https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Overview]. 
[Thrift message 
definition|https://github.com/apache/hive/blob/master/service-rpc/if/TCLIService.thrift].
 Using the Hive client has a number of advantages:

* Maintained by the Hive and Impala projects, so we benefit from shared 
investment.
* Does not depend on Drill's internals (such as Netty, value vectors, direct 
memory allocation, etc.)
* Already supported by many tools.
* Comes with the "Beeline" command line tool (like SqlLine.)
* The API is versioned, allowing easier client upgrades than Drill's 
unversioned network API.
* Returns data in a row-oriented format better suited to JDBC clients than 
Drill's (potentially large, direct-memory based) value vectors.
* Passes session options along with each query to allow the server to be 
stateless and to allow round-robin distribution of requests to servers.

The Hive API may not be a perfect fit: Hive assumes the existence of a 
metastore such as HMS. Still, this may be a better option than trying to 
improve the existing API.

A pilot approach would be to implement a Thrift server (perhaps borrowing Hive 
code) that turns around and uses the Drill client API to talk to the Drill 
server. If this "shim" server proves the concept, the code can move into the 
Drillbit itself.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7601) Shift column conversion to reader from scan framework

2020-02-24 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7601:
--

 Summary: Shift column conversion to reader from scan framework
 Key: DRILL-7601
 URL: https://issues.apache.org/jira/browse/DRILL-7601
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


At the time we implemented provided schemas with the text reader, the best path 
forward appeared to be to perform column type conversions within the scan 
framework including deep in the column writer structure.

Experience with other readers has shown that the text reader is a special case: 
it always writes strings, which Drill-provided converters can parse into other 
types. Other readers, however are not so simple: they often have their own 
source structures which must be mated to a column reader, and so conversion is 
generally best done in the reader where it can be specific to the nuances of 
each reader.

This ticket asks to restructure the conversion code to fit the 
reader-does-conversion pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7598) PostgreSQL-like functions for working with JSON

2020-02-23 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7598:
--

 Summary: PostgreSQL-like functions for working with JSON
 Key: DRILL-7598
 URL: https://issues.apache.org/jira/browse/DRILL-7598
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


>From a contributor on the Drill user mailing list:
{quote}PostgreSQL has a practical way to manipulate the json data. You can 
read: [https://www.postgresql.org/docs/12/datatype-json.html].

{quote}
The user's use case is as follows:
{code:json}
{"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}}
{"a":"rabbit","b":"14","c":{"c1":"green" ,"c4":"vanilla"}}
{"a":"cow"  ,"b":"28","c":{"c1":"blue" ,"c3":"black" ,"c5":{"d":"2","e":"3"}}}
{code}

Notice that the {{`c`}} column changes types. This causes Drill to fail in 
execution. Hence the suggestion to work with column {{c}} as JSON without 
parsing that JSON into Drill's relational schema.

Drill should offer such support. We've recently discussed introducing a similar 
feature in Drill which one could, with some humor, call "let JSON be JSON." The 
idea would be, as in PostreSQL, to simply represent JSON as text and allow the 
user to work with JSON using JSON-oriented functions. The PostreSQL link 
suggest that this is, in fact, a workable approach (though, as you not, doing 
so is slower than converting JSON to a relational structure.)

Today, however, Drill attempts to map JSON into a relational model so that the 
user can use [SQL operations to work on the 
data|https://drill.apache.org/docs/json-data-model/]. The Drill approach works 
well when the JSON is the output of a relational model (a dump of a relational 
table or query, say.) The approach does not work for "native" JSON in all its 
complexity. JSON is a superset of the relational model and so not all JSON 
files map to tables and columns.

To solve the user's use case, Drill would need to adopt a solution similar to 
PostgreSQL. In fact, Drill already has some of the pieces (such as the 
[CONVERT_TO/CONVERT_FROM 
operations|https://drill.apache.org/docs/data-type-conversion/#convert_to-and-convert_from]),
 but even these attempt to convert JSON to or from the relational model. What 
we need, so solve the general use case, are the kind of native JSON functions 
which PostgreSQL provides.

Fortunately, since Drill would store JSON as a VARCHAR, no work would be needed 
in the Drill "core". All that is needed is someone to provide a set of Drill 
functions (UDFs) to call out to some JSON library to perform the desired 
operations.

This feature would work best when the user can parse some parts of a JSON input 
file into relational structure, others as JSON. (This is the use case which the 
user list user faced.) So, we need a way to do that. See DRILL-7597 for a 
request for such a feature.

Combining the PostgreSQL-like JSON functions with the ability to read selected 
columns as JSON, might provide an elegant solution to the "messy JSON" problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7597) Read selected JSON colums as JSON text

2020-02-23 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7597:
--

 Summary: Read selected JSON colums as JSON text
 Key: DRILL-7597
 URL: https://issues.apache.org/jira/browse/DRILL-7597
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


See . The use case wishes to read selected JSON columns as JSON text rather 
than parsing the JSON into a relational structure as is done today in the JSON 
reader.

The JSON reader supports "all text mode", but, despite the name, this mode only 
works for scalars (primitives) such as numbers. It does not work for structured 
types such as objects or arrays: such types are always parsed into Drill 
structures (which causes the conflict describe in __.)

Instead, we need a feature to read an entire JSON value, including structure, 
as a JSON string.

This feature would work best when the user can parse some parts of a JSON input 
file into relational structure, others as JSON. (This is the use case which the 
user list user faced.) So, we need a way to do that.

Drill has a "provided schema" feature, which, at present, is used only for text 
files (and recently with limited support in Avro.) We are working on a project 
to add such support for JSON.

Perhaps we can leverage this feature to allow the JSON reader to read chunks of 
JSON as text which can be manipulated by those future JSON functions. In the 
example, column "c" would be read as JSON text; Drill would not attempt to 
parse it into a relational structure.

As it turns out, the "new" JSON reader we're working on originally had a 
feature to do just that, but we took it out because we were not sure it was 
needed. Sounds like we should restore it as part of our "provided schema" 
support. It could work this way: if you CREATE SCHEMA with column "c" as 
VARCHAR (maybe with a hint to read as JSON), the JSON parser would read the 
entire nested structure as JSON without trying to parse it into a relational 
structure.

This ticket asks to build the concept:

* Allow a `CREATE SCHEMA` option (to be designed) to designate a JSON field to 
be read as JSON.
* Implement the "read column as JSON" feature in the new EVF-based JSON reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7593) Standardize local paths

2020-02-19 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7593:
--

 Summary: Standardize local paths
 Key: DRILL-7593
 URL: https://issues.apache.org/jira/browse/DRILL-7593
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


Discovered in the context of DRILL-7589 (PR #1987) is the idea of standardizing 
our set of local file system paths used when Drill runs in embedded mode. There 
may also be an opportunity to unify local file system paths used in distributed 
mode.

In distributed mode, we use ZK for distribution: all shared data must be in a 
location visible to all Drillbits: either ZK or a DFS. There is some need for 
local storage such as for UDF staging and for spill files.

In local mode, all persistent storage occurs on the local file system; there is 
no ZK and there is no need to coordinate a set of Drillbits.

At present, the local paths are spread all over the config system. Code that 
wants to set up local paths (such as {{DirTestWatcher}}) must handle each 
directory specially. Then, either {{ClusterFixture}} or a unit test must set 
the proper config property to match the directory selection.

For example, from {{drill-module.conf}}:

{noformat}
drill.tmp-dir: "/tmp"
drill.tmp-dir: ${?DRILL_TMP_DIR}
...
 sys.store.provider: {
 local: {
 path: "/tmp/drill",
 }
 trace: {
 directory: "/tmp/drill-trace",
 filesystem: "file:///"
 },
 tmp: {
 directories: ["/tmp/drill"],
 filesystem: "drill-local:///"
 },
 compile: {
 code_dir: "/tmp/drill/codegen"
...
 spill: {
 // *** Options common to all the operators that may spill
 // File system to use. Local file system by default.
 fs: "file:///",
 // List of directories to use. Directories are created
 // if they do not exist.
 directories: [ "/tmp/drill/spill" ]
...
 udf: {
 directory: {
 // Base directory for remote and local udf directories, unique among clusters.
{noformat}

And probably more. To move where Drill stores temp files, the user must change 
all of these properties.

Fortunately, [~arina] did a nice job with the UDF directories: they all are 
computed from the base directory:

{noformat}
 directory: {
 // Base directory for remote and local udf directories, unique among clusters.
 base: ${drill.exec.zk.root}"/udf",

// Path to local udf directory, always created on local file system.
 // Root for these directory is generated at runtime unless Drill temporary 
directory is set.
 local: ${drill.exec.udf.directory.base}"/udf/local",

// Set this property if custom file system should be used to create remote 
directories, ex: fs: "file:///".
 // fs: "",
 // Set this property if custom absolute root should be used for remote 
directories, ex: root: "/app/drill".
 // root: "",

// Relative path to all remote udf directories.
 // Directories are created under default file system taken from Hadoop 
configuration
 // unless ${drill.exec.udf.directory.fs} is set.
 // User home directory is used as root unless ${drill.exec.udf.directory.root} 
is set.
 staging: ${drill.exec.udf.directory.base}"/staging",
 registry: ${drill.exec.udf.directory.base}"/registry",
 tmp: ${drill.exec.udf.directory.base}"/tmp"
 }
{noformat}

So, can we do the same thing for all the other local directories? Allow each to 
be custom-set, but default them to be computed from a single base directory. 
That way, if a unit test or install wants to move the Drill local directories 
to, say, {{/var/drill/tmp}}, they only need change a single config line and 
everything else follows automatically.

This can be done in the existing conf file as was done for UDFs. And, I guess 
to preserve compatibility, we'd have to leave the properties where they are; 
we'd just change their values.

This ticket asks to:

* Work out a good solution.
* Implement it in the config system
* Scrub the unit tests and {{DirTestWatcher}} to determine where we can 
simplify code by reusing this solution rather than ad-hoc, per directory 
configs.
* Modify {{DirTestWatcher}} to coordinate with the config system: Set the base 
directory in config, then use the configured paths for each of the persistent 
store, profile, UDF and other directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7590) Refactor plugin registry

2020-02-17 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7590:
--

 Summary: Refactor plugin registry
 Key: DRILL-7590
 URL: https://issues.apache.org/jira/browse/DRILL-7590
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


The plugin registry connects configurations, stored in ZK, with 
implementations, which are Java classes. The registry handles a large number of 
tasks including:

* Populating "bootstrap" plugin configurations and handling upgrades.
* Reading from, and writing to, the persistent store in ZK.
* Handling "normal" (configured) plugins and special system plugins (which have 
no configuration.)
* Handle format plugins which are always associated with the DFS storage plugin.
* And so on.

The code has grown overly complex. As we look to add a new, cleaner plugin 
mechanism, we will start by cleaning up what we have to allow the new mechanism 
to be one of many.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7583) Remove STOP status in favor of fail-fast

2020-02-13 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7583:
--

 Summary: Remove STOP status in favor of fail-fast
 Key: DRILL-7583
 URL: https://issues.apache.org/jira/browse/DRILL-7583
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


The original error solution was a complex process of a) setting a failed flag, 
b) telling all upstream operators they have failed, c) returning a {{STOP}} 
status.  Drill has long supported a "fail-fast" error path based on throwing an 
exception; relying on the fragment executor to clean up the operator stack. 
Recent revisions have converted most operators to use the simpler fail-fast 
strategy based on throwing an exception instead of using the older {{STOP}} 
approach. This change simply removes the old, now-unused {{STOP}} based path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7576) Fail fast in operators

2020-02-08 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7576:
--

 Summary: Fail fast in operators
 Key: DRILL-7576
 URL: https://issues.apache.org/jira/browse/DRILL-7576
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Continues to the work to clean up operators to "fail fast" by throwing an 
exception instead of using the more involved STOP status.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7574) Generalize projection parser

2020-02-07 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7574:
--

 Summary: Generalize projection parser
 Key: DRILL-7574
 URL: https://issues.apache.org/jira/browse/DRILL-7574
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


EVF contains a bit of code called the"projection parser": it takes a projection 
list and converts it into a form useful for scan projection. The prior version 
handled single-level arrays, such as needed for the {{`columns`}} column in the 
text reader. For JSON, we must handle arbitrary column structures such as:

{noformat}
a, a.b, a[1], a[1][2], a.[1][2].b
{noformat}

Adding the DICT type means that we must be a bit more general in the parser. 
This ticket fixes these issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7572) JSON structure parser

2020-02-06 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7572:
--

 Summary: JSON structure parser
 Key: DRILL-7572
 URL: https://issues.apache.org/jira/browse/DRILL-7572
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


DRILL-6953, PR #1913 provides an EVF-version of the JSON reader. The original 
plan was to commit a first draft, with the feature disabled, then evolve from 
there. Review comments suggested we ensure that the first PR pass all unit 
tests with the feature enabled. Another comment suggested the JSON parser 
portion be reusable for other format or storage plugins. All good ideas.

The amount of change in the original PR was becoming too large. So, we'll go 
ahead and split the work into smaller portions.

This ticket is for the "JSON structure parser." This parser accepts the token 
stream from the Jackson parser and provides a framework to create a tree 
structure that describes the data.  To keep the code simple, to simplify tests, 
and to make review easier, the structure parser does not directly integrate 
with the rest of Drill. Instead, it provides a number of listener interfaces 
which the next level implements. (The next level will be in another PR.)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7567) Metastore enhancements

2020-02-03 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7567:
--

 Summary: Metastore enhancements
 Key: DRILL-7567
 URL: https://issues.apache.org/jira/browse/DRILL-7567
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers


The Metastore feature shipped as a Beta. Review of the documentation identified 
a number of opportunities for improvement before the feature leaves Beta.

* Should the Metastore be configured in its own file? Does this push us in the 
direction of each feature having its own set of config files? Or, should config 
move into the normal Drill config files?
* Provide a detailed schema and description of Metadata entities, like the Hive 
metadata schema.
* Provide an out-of-the-box sample Metastore for some of Drills demo tables.
* Provide a Metastore tutorial. Refer to the sample Metastore in the tutorial. 
Many folks learn best by trying things hands-on.
* Solve read/write consistency issues to avoid the need for the error/recovery 
described for {{metastore.metadata.fallback_to_file_metadata}}.
* Boot-time config is stored in the {{drill.metastore}} namespace. But, 
Metastore SYSTEM/SESSION options are in the {{drill.exec}} namespace. This is 
confusing. Let's be consistent.
* {{drill.exec.storage.implicit.last_modified_time.column.label}} is a bug: 
Drill internal names should never conflict with user-defined column names. 
Figure out where they conflict the issue. No user can ever guarantee that some 
name will never be used in their tables. Nor can users easily fix the issue if 
it occurs. (Note: this is a flaw with our implicit columns as well.)
* Provide a form of ANALYZE TABLE that automatically reuses settings from any 
previous run. It will otherwise be very user unfriendly for the user to have to 
find a place to store the ANALYZE TABLE command so that they can submit exactly 
the same one each time. In fact, experience with Impala suggests that end users 
will have no idea about schema, they just want the latest metadata. Such users 
won't even know the details of a command some other user might have submitted.
* The Iceberg metastore requires atomic rename. But, the most common use case 
for Drill today is the cloud. S3 does not support atomic rename. We need to fix 
this.
* The documentation says we us the "plugin name" as part of the table key. But, 
for DFS, say, the user can have dozens of plugin configs, each with a distinct 
name. Each can reuse the same workspace name of, say "foo". Thus "dfs/foo" is 
ambiguous. But, "hdfs1/foo", and "local/foo" are unique if we use storage 
plugin config names.
* It is not clear if the Iceberg metastore supports HDFS security and Kerberos 
tickets. If not, then it won't work in a production deployment.
* The metastore is meant to store schema. A key use is when schema is 
ambiguous. But, metastore gathers schema the same way that Drill queries 
tables. If schema is ambiguous, the ANALYZE TABLE will fail. Thus we do not 
actually solve the ambiguous schema problem. We need a solution.
* Better partition support. Drill has a long-standing usability issue that 
users must do their own partition coding. If I want data from 2018-11 to 
2019-02 (one quarter worth of data), I have to write the very ugly

{code:sql}
WHERE (dir0 = 2018 AND dir1 >= 11)
OR (dir0 = 2019 AND dir1 <= 1)
{code}

With Hive/Impala/Presto I can just write:

{code:sql}
WHERE transDate IN ('2018-11-01', '2019-01-31')
{code}
* Allow staged gathering of stats. Allow me to first gather stats and review 
them for quality before I have my users start using them. As it is, there is no 
ability to gather them, enable the option for a session for testing, verify 
that things work right, then turn it on for everyone. That is, in a shared 
system, all heck can break loose in the current implementation.
* Review the internal Metastore tables. See many comments about the structure 
in the Metastore documentation PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7564) Revisit documentation structure

2020-02-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7564:
--

 Summary: Revisit documentation structure
 Key: DRILL-7564
 URL: https://issues.apache.org/jira/browse/DRILL-7564
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


This is a placeholder JIRA to track overall modifications to Drill 
documentation.

Drill's online documentation is one of the best for an open source project. The 
original authors documented complex topics starting from nothing other than 
what developers could provide.

As time has gone on, the earlier efforts have put us in a position where we can 
now reflect on how people actually use the documentation, and have the luxury 
of features that are not mature. This gives us an opportunity to review and 
revise the structure based on experience.

We an also learn from some of the best-structured documentation such as Python 
and PHP. Drill's documentation will be maintained by volunteers, so we may not 
be able to follow the patterns of commercial tools such as MySQL.

Presented here is a proposed restructured outline. Creating such an outline 
takes time, so this ticket will evolve as we discuss and explore our options.

Perhaps documentation can be split into several top-level sections:

* Getting Started: Go from "what's Drill" to running queries on a single-node 
Drill cluster against local files, example data and easy-to-configure external 
systems.
* Drill for Users: "How-to" focused discussions of major features for people 
who want to use Drill to run queries against data sources.
* Drill for Operators: How to deploy Drill in production on bare metal, YARN, 
K8s and others. How to set up security and so on.
* Reference: SQL, plugin and other reference material.
* Drill for Developers: How to build Drill, how to create plugins and UDFs, etc.

In practice, these might just be groups of chapters with no top-level 
structure, but it is helpful here to divide up material by target audience.

Crude mapping of existing topics:

* Getting Started
  ** Getting Started / Drill Introduction
  ** Getting Started / Why Drill
  ** Install Drill (local install)
  ** Tutorials (basics)
  ** Configure Drill (local options)
* Drill for Users
 ** Tutorials (beyond the basics)
 ** Connect a Data Source
  ** ODBC/JDBC Interfaces
  ** Query Data
  ** Ecosystem
* Drill for Operators
  ** Drill on Bare Metal
*** Install Drill (cluster install)
  ** Drill-on-YARN
  ** Drill on K8s (when available)
  ** Configure Drill (cluster settings, security, etc.)
  ** Performance Tuning
  ** Log and Debug
  ** Troubleshooting
* Reference
  ** Release Notes
  ** SQL Reference
  ** Data Sources and File Formats
  ** Sample Datasets
* Drill for Developers
  ** Project Bylaws
  ** Developer Information
  ** Develop Custom Functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7563) Docker & Kubernetes Drill server container

2020-02-01 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7563:
--

 Summary: Docker & Kubernetes Drill server container
 Key: DRILL-7563
 URL: https://issues.apache.org/jira/browse/DRILL-7563
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Drill provides two Docker containers:

* [Build Drill from 
sources|https://github.com/apache/drill/blob/master/Dockerfile]
* [Run Drill in interactive embedded 
mode|https://github.com/apache/drill/blob/master/distribution/Dockerfile]

User feedback suggests that these are not quite the right solutions to run 
Drill in a K8s (or OpenShift) cluster. In addition, we need a container to run 
a Drill server. This ticket summarizes the tasks involved.

h4. Container Image

The container image should:

* Start with the OpenJDK base image with minimal extra packages.
* Download and install an official Drill release.

We may then want to provide two derived images:

The Drillbit image which:

* Configures Drill for production and as needed in the following steps.
* Provides entry points for the Drillbit and for Sqlline
* Exposes Drill's four ports
* Accept as parameters things like the ZK host IP(s).

The Sqlline image, meant to be run in interactive mode (like the current 
embedded image) and which:

* Accept as parameters the ZK host IP(s).

h4. Runtime Environment

Drill has very few dependencies, but it must have a running ZK.

* Start a [ZK container|https://hub.docker.com/_/zookeeper/].
* A place to store logs, which can be in the container by default, stored on 
the host file system via a volume mount.
* Access to a data source, which can be configured via a storage plugin stored 
in ZK.
* Ensure graceful shutdown integration with the Docker shutdown mechanism.

h4. Running Drill in Docker

Users must run at least one Drillbit, and may run more. Users may want to run 
Sqlline.

* The Drillbit container requires, at a minimum, the IP address of the ZK 
instance(s).
* The Sqlline container requires only the ZK instances, from which it can find 
the Drillbit.

Uses will want to customize some parts of Drill: at least memory, perhaps any 
of the other options. Provide a way to pass this information into the container 
to avoid the need to rebuild the container to change configuration.

h4. Running Drill in K8s

The containers should be usable in "plain" Docker. Today, however, many people 
use K8s to orchestrate Docker. Thus, the Drillbit (but probably not the 
Sqlline) container should be designed to work with K8s. An example set of K8s 
YAML files should illustrate:

* Create a host-mount file system to capture Drill logs and query profiles.
* Optionally write Drill logs to stdout, to be captured by {{fluentd}} or 
similar tools.
* Pass Drill configuration (both HOCON and envrironment) as config maps.
* Pass ZK as an environment variable (the value of which would, one presumes, 
come from some kind of service discovery system.)

The result is that the user should be able to manually tinker with the YAML 
files, then use {{kubeadm}} to launch, monitor and stop Drill. The user sets 
cluster size manually by launching the desired number of Drill pods.

h4. Helm Chart for Drill

The next step is to wrap the YAML files in a Helm chart, with parameters 
exposed for the config options noted above.

h4. Drill Operator for K8s
 
Full K8s integration will require an operator to manage the Drill cluster. K8s 
operators are often written in Go, though doing so is not necessary. Drill 
already includes Drill-on-YARN which is, essential a "YARN operator." Repurpose 
this code to work with K8s as the target cluster manager rather than YARN. 
Reuse the same operations from DoY: configure, start, resize and stop a cluster.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7560) Free leaked memory after failed unit tests

2020-01-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7560:
--

 Summary: Free leaked memory after failed unit tests
 Key: DRILL-7560
 URL: https://issues.apache.org/jira/browse/DRILL-7560
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


Many unit tests work directly with vectors or record batches returned from 
queries. As a result, tests are responsible for correctly freeing memory held 
by vectors. Simple example:

{code:java}
  public void testDummyReader() throws Exception {
RowSet results = client.queryBuilder()
.sql("SELECT a, b, c from dummy.myTable")
.rowSet();
assertEquals(3, results.rowCount());
results.clear();
  }
{code}

If the test fails, memory will be leaked and the test will report a failure on 
shut-down.

Now, in most cases, the test has already failed; there is no real harm in 
leaking memory because the problem will go away once the test is fixed and 
works and the assert passes.

Still, one could argue that the above is messy. Perhaps a cleaner solution is:

{code:java}
  public void testDummyReader() throws Exception {
RowSet results = null;
try {
  results = client.queryBuilder()
  .sql("SELECT a, b, c from dummy.myTable")
  .rowSet();
  assertEquals(3, results.rowCount());
} finally {
  if (results != null) { results.clear(); }
  }
{code}

Obviously, however, this is tedious and error-prone. Since these are tests, 
failure could occur anywhere, maybe even before the row set is returned to the 
test.

So, a more general solution is to walk the allocator tree and free any 
remaining resources. Call this when we shut down the test cluster, which JUnit 
will do whether the test succeeds or fails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7559) Generalize provided schema handling for non-DFS plugins

2020-01-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7559:
--

 Summary: Generalize provided schema handling for non-DFS plugins
 Key: DRILL-7559
 URL: https://issues.apache.org/jira/browse/DRILL-7559
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers


 Drill offers a "provided schema" mechanism which is currently a work in 
progress.  
DRILL-7458: Base framework for storage plugins, shows how a custom scan can 
support a provided schema via a single line of code:

{code:java}
builder.typeConverterBuilder().providedSchema(subScan.getSchema());
{code}

The challenge, however, is how the plugin obtains the schema. At present, it is 
quite complex and ad-hoc:

* The plugin's schema factory would look up the  schema in some plugin-specific 
way.
* The schema would then be passed as part of the scan spec to the group scan.
* The group scan would pass the provided schema to the sub scan.
* The sub-scan carries the schema into the execution step so that, finally, the 
plugin can use the above line of code.

Needless to say, the developer experience is not quite a simple as it might be. 
In particular, the developer has to solve the complex problem of where to store 
the schema. DFS-based format plugins can use the existing file-based mechanism. 
Non-DFS plugins have no such choice.

So, the improvements we need are:

* Provide a reusable, shared schema registry that works even if Drill is not 
used with a DFS.
* Augment the SQL commands for creating a schema to use this new registry.
* Add the schema to the Base framework classes so it is automatically looked up 
from the registry, passed along the scan chain, and set on the reader builder 
at run time.

Note that we can probably leverage the work done for the metastore API. A 
metastore generally stores two kinds of data: 1) schema and 2) stats. Perhaps 
we can implement a DB-based version for non-DFS configurations.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7558) Generalize filter push-down planner phase

2020-01-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7558:
--

 Summary: Generalize filter push-down planner phase
 Key: DRILL-7558
 URL: https://issues.apache.org/jira/browse/DRILL-7558
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


DRILL-7458 provides a base framework for storage plugins, including a 
simplified filter push-down mechanism. [~volodymyr] notes that it may be *too* 
simple:

{quote}
What about the case when this rule was applied for one filter, but planner at 
some point pushed another filter above the scan, for example, if we have such 
case:

{code}
Filter(a=2)
  Join(t1.b=t2.b, type=inner)
Filter(b=3)
Scan(t1)
Scan(t2)
{code}

Filter b=3 will be pushed into scan, planner will push filter above join:

{code}
Join(t1.b=t2.b, type=inner)
Filter(a=2)
Scan(t1, b=3)
Scan(t2)
{code}

In this case, check whether filter was pushed is not enough.
{quote}

Drill divides planning into a number of *phases*, each defined by a set of 
*rules*. Most storage plugins perform filter push-down during the physical 
planning stage. However, by this point, Drill has already decided on the degree 
of parallelism: it is too late to use filter push-down to set the degree of 
parallelism. Yet, if using something like a REST API, we want to use filters to 
help us shard the query (that is, to set the degree of parallelism.)
 
DRILL-7458 performs filter push-down at *logical* planning time to work around 
the above limitation. (In Drill, there are three different phases that could be 
considered the logical phase, depending on which planning options are set to 
control Calcite.)

[~volodymyr] points out that the the logical plan phase may be wrong because it 
will perform rewrites of the type he cited.

Thus, we need to research where to insert filter push down. It must come:

* After rewrites of the kind described above.
* After join equivalence computations. (See DRILL-7556.)
* Before the decision is made about the number of minor fragments.

The goal of this ticket is to either:

* Research to identify an existing phase which satisfies these requirements, or
* Create a new phase.

Due to the way Calcite works, it is not a good idea to have a single phase 
handle two tasks that depend on one another. That is, we cannot combine filter 
push down in a phase which defines the filters, nor can we add filter push-down 
in a phase that choose parallelism.

Background: Calcite is a rule-based query planner inspired by 
[Volcano|https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf].
The above issue is a flaw with rule-based planners and was identified as early 
as the [Cascades query framework 
paper|https://www.csd.uoc.gr/~hy460/pdf/CascadesFrameworkForQueryOptimization.pdf]
 which was the follow-up to Volcano.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7557) Revise "Base" storage plugin filter-push down listerner with a builder

2020-01-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7557:
--

 Summary: Revise "Base" storage plugin filter-push down listerner 
with a builder
 Key: DRILL-7557
 URL: https://issues.apache.org/jira/browse/DRILL-7557
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


DRILL-7458 introduces a base framework for storage plugins and includes a 
simplified mechanism for filter push down. Part of that mechanism includes a 
"listener", with the bulk of the work done in a single method:

{code:java}
Pair> transform(GroupScan groupScan,
  List> andTerms, Pair 
orTerm);
{code}

Reviewers correctly pointed out that this method might be a bit too complex.

The listener pattern pretty much forced the present design. To improve it, we'd 
want to use a different design; maybe some kind of builder which might:

* Accept the CNF and DNF terms via dedicated methods.
* Perform a processing step.
* Provide a number of methods to communicate the results, such as 1) whether a 
new group scan is needed, 2) any CNF terms to retain, and 3) any DNF terms to 
retain.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7556) Generalize the "Base" storage plugin filter push down mechanism

2020-01-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7556:
--

 Summary: Generalize the "Base" storage plugin filter push down 
mechanism
 Key: DRILL-7556
 URL: https://issues.apache.org/jira/browse/DRILL-7556
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


DRILL-7458 adds a Base framework for storage plugins which includes a 
simplified representation of filters that can be pushed down into Drill. It 
makes the assumption that plugins can generally only handle filters of the form:

{code}
column relop constant
{code}

For example, {{`foo` < 10}} or {{`bar` = "Fred"}}. (The code "flips" 
expressions of the form {{constant relop column}}.)

[~volodymyr] suggests this is too narrow and suggests two additional cases:

{code}
column-expr relop constant
fn(column) = conststant
{code}

Examples:

{code:sql}
foo + 10 = 20
substr(bar, 2, 6) = 'Fred'
{code}

The first case should be handled by a general expression rewriter: simplify 
constant expressions:

{code:sql}
foo + 10 = 20 --> foo = 10
{code}

Then, filter-push down need only handle the simplified expression rather than 
every push-down mechanism needing to do the simplification.

For this ticket, we wish to handle the second case: any expression that 
contains a single column associated with the target table. Provide a new 
push-down node to handle the non-relop case so that simple plugins can simply 
ignore such expressions, but more complex plugins (such as Parquet) can 
optionally handle them.

A second improvement is to handle the more complex case: two or more columns, 
all of which come from the same target table. For example:

{code:sql}
foo + bar = 20
{code}

Where both {{foo}} and {{bar}} are from the same table. It would be a very 
sophisticated plugin indeed (maybe the JDBC storage plugin) which can handle 
this case, but it should be available.

As part of this work, we must handle join-equivalent columns:

{code:sql}
SELECT ... FROM t1, t2
  WHERE t1.a = t2.b
  AND t1.a = 20
{code}

If the plugin for table {{t2}} can handle filter push-down, then the expression 
{{t1.a = 20}} is join-equivalent to {{t2.b = 20}}.

It is not clear if the Drill logical plan already handles join equivalence. If 
not, it should be added. If so, the filter push-down mechanism should add 
documentation that describes how the mechanism works.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7555) Standardize Jackson ObjectMapper usage

2020-01-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7555:
--

 Summary: Standardize Jackson ObjectMapper usage
 Key: DRILL-7555
 URL: https://issues.apache.org/jira/browse/DRILL-7555
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers


Drill makes heavy use of Jackson to serialize Java objects to/from JSON. Drill 
has added multiple custom serializers. See the  {{PhysicalPlanReader}} 
constuctor for a list of these.

However, many modules in Drill declare their own {{ObjectMapper}} instances, 
often without some (or all) of the custom Drill mappers. This is tedious and 
error-prone.

We should:

* Define a standard Drill object mappper.
* Replace all ad-hoc instances of {{ObjectMapper}} with the Drill version (when 
reading/writing Drill-defined JSON).

Further, storage plugins need an {{ObjectMapper}} to convert a scan spec from 
JSON to Java. (It is not clear why we do this serialization, or if it is 
needed, but that is how things work at present.) Plugins don't have access to 
any of the "full feature" object mappers: each plugin would have to cobble 
together the serdes it needs.

So, after standardizing the object mappers, pass in an instance of that 
standard mapper to the storage plugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7553) Modernize type management

2020-01-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7553:
--

 Summary: Modernize type management
 Key: DRILL-7553
 URL: https://issues.apache.org/jira/browse/DRILL-7553
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


This is a roll-up issue for our ongoing discussion around improving and 
modernizing Drill's runtime type system. At present, Drill approaches types 
vastly differently than most other DB and query tools:

 * Drill does little (or no) plan-time type checking and propagation. Instead, 
all type management is done at execution time, in each reader, in each 
operator, and ultimately in the client.
 * Drill allows structured types (Map, Dict, Arrays), but does not have the 
extended SQL statements to fully utilize these types.
* Drill supports varying types: two readers can both read column {{c}}, but can 
do so with different types. We've always hoped to discover some way to 
reconcile the types. But, at present, the functionality is buggy and 
incomplete. It is not clear that a viable solution exists. Drill also provides 
"formal" varying types: Union and List. These types are also not fully 
supported.

These three topics are closely related. "Schema-free" means we must infer types 
at read time and so Drill cannot do plan-type type analysis of the kind done in 
other engines. Because of schema-on-read (which is what "schema-free" really 
means), two readers can read different types for the same fields, and so we end 
up with varying or inconsistent types, and are forced to figure out some way to 
manage the conflicts.

The gist of the proposal explored in this ticket is to exploit the learning 
from other engines: to embrace types when available, and to impose tractable 
rules when types are discovered at run time.

h4. Proposal Summary

This is very much a discussion draft. Here are some suggestions to get started.

# Set as our goal to manage types at plan time. Runtime type discovery becomes 
a (limited) special case.
# Pull type resolution, propagation and checking into the planner where it can 
be done once per query. Move it out of execution where it must be done multiple 
times: once per operator per minor fragment. Implement the standard DB type 
checking and propagation rules. (These rules are currently implicitly 
implemented deep in the code gen code.)
# Generate operator code in the planner; send it to workers as part of the 
physical plan (to avoid the need to generate the code on each worker.)
# Provide schema-aware extensions for storage and format plugins so that they 
can advertise a schema when known. (Examples; Hive sources get schemas from 
HMS, JDBC sources get schema from the underlying database, Avro, Parquet and 
others obtain schema from the target files, etc.) This mechanism works with, 
but is in addition to, the Drill metastore. 
# Separate the concepts of "schema-free" (no plan-time schema) from 
"schema-on-read" (schema is known in the planner, and data is read into that 
schema by readers; e.g. the Hive model.) Drill remains schema-on-read (for 
sources that need it), but does not attempt the impossible with schema-free 
(that is, we no longer read inconsistent data into a relational model and hope 
we can make it work.)
# For convenience, allow "schema-free" (no plan-time schema). The restriction 
is that all readers *must* produce the same schema It is a fatal (to the query) 
error for an operator to receive batches with different schemas. (The reasons 
can be discussed separately.)
# Preserve the Map, Dict and Array types, but with tighter semantics: all 
elements must be of the same type.
# Replace the Union and List types with a new type: Java objects. Java objects 
can be anything and can vary from row-to-row. Java types are processed using 
UDFs (or Drill functions.)
# All "extended" types (complex: Map, Dict and Array, or Java objects) must be 
reduced to primitive types in a top-level tuple if the client is ODBC (which 
cannot handle non-relational types.) The same is true if the destination is a 
simple sink such as CSV or JDBC.
# Provide a light-weight way to resolve schema ambiguities that are identified 
by the new, stricter type rules. The light-weight solution is either a file or 
some kind of simple Drill-managed registry akin to the plugin registry. Users 
can run a query, see if there are conflicting types, and, if so, add a 
resolution rule to the registry. The user then reruns the query with a clean 
result.

In the past couple of years we have made progress in some of these areas. This 
ticket suggests we bring those threads together in a coherent strategy.

h4. Arrow/Java/Fixed Block/Something Else Storage

The ideas here are independent of choices we might make for our internal data 
representation format. The above design works equally well with either Drill or 
Arrow vectors, or with something else 

[jira] [Created] (DRILL-7545) Projection ambiguities in complex types

2020-01-21 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7545:
--

 Summary: Projection ambiguities in complex types
 Key: DRILL-7545
 URL: https://issues.apache.org/jira/browse/DRILL-7545
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.17.0
Reporter: Paul Rogers


Summarized from an e-mail chain on the dev mailing list:

We recently introduced the DICT type. We also added the EVF framework. We have 
a bit of code which parses the projection list, then checks if a column from a 
reader is consistent with projection. The idea is to ensure that the columns 
produced by a Scan will be valid when a Project later tries to use them with 
the given project list. And, if the Scan says it can support Project-push-down, 
then the Scan is obligated to do the full check.

First we'll explain how I'll solve the projection problem given your 
explanation. Then we'll point out three potential ambiguities. Thanks to Bohdan 
for his explanations.

The problems here are not due to any one person. As explained below, they are 
due to trying to add concepts into SQL which SQL is not well-suited to support.

h4. Projection for DICT Types

Queries go through two major steps: planing and execution. At the planning 
stage we use SQL syntax for the project list. For example:

{code:sql}
explain plan for SELECT a, e.`map`.`member`, `dict`['key'], `array`[10]  FROM 
cp.`employee.json` e
{code}

The planner sends an execution plan to operators. The project list appears in 
JSON. For the above:

{code:json}
   "columns" : [ "`a`", "`map`.`member`", "`dict`.`key`", "`array`[10]" ],
{code}

We see that the JSON works as Bohdan described:

* The SQL map "map.member" syntax is converted to "`map`.`member`" in the JSON 
plan.
* The SQL DICT "`dict`['key']" syntax is converted to a form identical to maps: 
"`dict`.`key`".
* The SQL DICT/array "`array`[10]" syntax is converted to "`array`[10]" in JSON.

That is, on the execution side, we can't tell the difference between a MAP and 
a DICT request. We also can't tell the difference between an Array and DICT 
request. Apparently, because of this, the Schema Path parser does not recognize 
DICT syntax.

Given the way projection works, "a.b" and "a['b']" are identical: either works 
for both a map or a DICT with VARCHAR keys. That is, we just say that map and 
array projection are both compatible with a DICT column?

h4. Projection Checking in Scan

Mentioned above is that a Scan that supports Project-push-down must ensure that 
the output columns match the projection list. Doing that check is quite easy 
when the projection is simple: `a`. The column `a` can match a data column `a` 
of any type.

The task is bit harder when the projection is an array `a[0]`. Since this now 
means either array or DICT with an INT key, this projected column can match:

* Any REPEATED type
* A LIST
* A non-REPEATED DICT with INT, BIGINT, SMALLINT or TINYINT keys (ignoring the 
UINTx types)
* A REPEATED DICT with any type of key
* A UNION (because a union might contain a repeated type)

We can also handle a map projection: `a.b` which matches:

* A (possibly repeated) map
* A (possibly repeated) DICT with VARCHAR keys
* A UNION (because a union might contain a possibly-repeated map)
* A LIST (because the list can contain a union which might contain a 
possibly-repeated map)

Things get very complex indeed when we have multiple qualifiers such as 
`a[0][1].b` which matches:

* A LIST that contains a repeated map
* A REPEATED LIST that contains a (possibly-repeated) map
* A DICT with an INT key that has a value of a repeated map
* A REPEATED DICT that contains an INT key that contains a MAP
* (If we had sufficient metadata) A LIST that contains a REPEATED DICT with a 
VARCHAR key.

h4. DICT Projection Ambiguities

The DICT type introduces an ambiguity. Note above that `a.b` can refer to 
either a REPEATED or non-REPEATED MAP. If non-repeated, `a.b` means to get the 
one value for member `b` of map `a`. But, if the map is REPEATED, this means to 
project an array of `b` values obtained from the array of maps.

For a DICT, there is an ambiguity with `a[0][1]` if the DICT is repeated DICT 
of INT keys and REPEATED BIGINT values: that is ARRAY>>. Does `a[0][1]` mean to pull out the 0th element of the 
REPEATED DICT, then lookup where the key == 1? Or, does it mean to pull out all 
the DICT array values where the key == 0 and then pull out the 1st value of the 
INT array? That is, because we have an implied (in all members of the array) 
syntax, one can interpret this case as:

{noformat}
repeatedDict[0].valueOf(1) --> ARRAY
-- All the values in the key=1 array of element 0
{noformat}

or

{noformat}
repeatedDict.valueOf(0)[1] --> ARRAY
-- All the values in the key=0, element 1 positions across all DICT elements
{noformat}

It would seem to make sense to prefer the first interpretation. Unfortunately, 
MAPs already use the 

[jira] [Created] (DRILL-7522) JSON reader (v1) omits null columns in SELECT *

2020-01-12 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7522:
--

 Summary: JSON reader (v1) omits null columns in SELECT *
 Key: DRILL-7522
 URL: https://issues.apache.org/jira/browse/DRILL-7522
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.17.0
Reporter: Paul Rogers


Run the following unit test: {{TestStarQueries.testSelStarOrderBy}}, runs the 
following query:

{code:sql}
select * from cp.`employee.json` order by last_name
{code}

The query reads a Foodmart file {{customer.json}} that has records like this:

{code:json}
{"employee_id":53,...","end_date":null,"salary":...}
{code}

The field {{end_date}} turns out to be null for all records in 
{{customer.json}}.

Then, look at the verification query. It carefully includes all fields *except* 
{{end_date}}. That is, the test was written to expect that the JSON reader will 
omit a column that has all NULL values.

While it might seem OK to omit all-NULL columns (they don't have any data), the 
problem is that Drill is a distributed system. Suppose we query a directory of 
50 such files, some of which have all-NULLs in one field, some of which have 
all-NULLs in another. Although the files have the same schema, {{SELECT *}} 
will return different schemas (depending on which file has which non-NULL 
columns.)

A downstream operator will have to merge these schemas. And, since Drill fills 
in a Nullable INT field for missing columns, we might end up with a schema 
change exception because the actual field type is VARCHAR when it appears.

One can argue that {{SELECT *}} means "return all columns", not "return all 
columns except those that happen to be null in the first batch." Yes, we have 
the problem of not knowing the actual field type. Eventually, provided schemas 
will resolve such issues.

Note that in the "V2" JSON reader, {{end_date}} is included in the query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7510) Incorrect String/number comparison with union types

2020-01-03 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7510:
--

 Summary: Incorrect String/number comparison with union types
 Key: DRILL-7510
 URL: https://issues.apache.org/jira/browse/DRILL-7510
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


Run the following test: {{TestTopNSchemaChanges.testUnionTypes()}}. It will 
pass. Look at the expected output:

{code:java}
builder.baselineValues(0l, 0l);
builder.baselineValues(1.0d, 1.0d);
builder.baselineValues(3l, 3l);
builder.baselineValues(4.0d, 4.0d);
builder.baselineValues(6l, 6l);
builder.baselineValues(7.0d, 7.0d);
builder.baselineValues(9l, 9l);
builder.baselineValues("2", "2");
{code}

The string values sort after the numbers.

After the fix for DRILL-7502, we get the following output:

{code:java}
builder.baselineValues(0l, 0l);
builder.baselineValues(1.0d, 1.0d);
builder.baselineValues("2", "2");
builder.baselineValues(3l, 3l);
builder.baselineValues(4.0d, 4.0d);
builder.baselineValues("5", "5");
builder.baselineValues(6l, 6l);
builder.baselineValues(7.0d, 7.0d);
{code}

This accidental fix suggests that the original design was to convert values to 
the same type, then compare them. Converting numbers to strings, say, would 
cause them to be lexically ordered, as in the second output.

The {{UNION}} type is poorly supported, so it is likely that this bug does not 
affect actual users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7507) Convert fragment interrupts to exceptions

2020-01-01 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7507:
--

 Summary: Convert fragment interrupts to exceptions
 Key: DRILL-7507
 URL: https://issues.apache.org/jira/browse/DRILL-7507
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Operations periodically check if they should continue by calling the 
{{shouldContinue()}} method. If the method returns false, operators return a 
{{STOP}} status in some form.

This change modifies handling to throw an exception instead; cancelling a 
fragment the same way that we handle errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7506) Simplify code gen error handling

2020-01-01 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7506:
--

 Summary: Simplify code gen error handling
 Key: DRILL-7506
 URL: https://issues.apache.org/jira/browse/DRILL-7506
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Code generation can generate a variety of errors. Most operators bubble these 
exceptions up several layers in the code before catching them. This patch moves 
error handling closer to the code gen itself to allow a) simpler code, and b) 
clearer error messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7503) Refactor project operator

2019-12-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7503:
--

 Summary: Refactor project operator
 Key: DRILL-7503
 URL: https://issues.apache.org/jira/browse/DRILL-7503
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


Work on another ticket revealed that the Project operator ("record batch") has 
grown quite complex. The setup phase lives in the operator as one huge 
function. The function combines the "logical" tasks of working out the 
projection expressions and types, the code gen for those expressions, and the 
physical setup of vectors.

The refactoring breaks up the logic so that it is easier to focus on the 
specific bits of interest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7502) Incorrect/invalid codegen for typeof() with UNION

2019-12-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7502:
--

 Summary: Incorrect/invalid codegen for typeof() with UNION
 Key: DRILL-7502
 URL: https://issues.apache.org/jira/browse/DRILL-7502
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


The {{typeof()}} function is defined as follows:

{code:java}
  @FunctionTemplate(names = {"typeOf"},
  scope = FunctionTemplate.FunctionScope.SIMPLE,
  nulls = NullHandling.INTERNAL)
  public static class GetType implements DrillSimpleFunc {

@Param
FieldReader input;
@Output
VarCharHolder out;
@Inject
DrillBuf buf;

@Override
public void setup() {}

@Override
public void eval() {
  String typeName = input.getTypeString();
  byte[] type = typeName.getBytes();
  buf = buf.reallocIfNeeded(type.length);
  buf.setBytes(0, type);
  out.buffer = buf;
  out.start = 0;
  out.end = type.length;
}
  }
{code}

Note that the {{input}} field is defined as {{FieldReader}} which has a method 
called {{getTypeString()}}. As a result, the code works fine in all existing 
tests in {{TestTypeFns}}.

I tried to add a function to use {{typeof()}} on a column of type {{UNION}}. 
When I did, the query failed with a compile error in generated code:

{noformat}
SYSTEM ERROR: CompileException: Line 42, Column 43: 
  A method named "getTypeString" is not declared in any enclosing class nor any 
supertype, nor through a static import
{noformat}

The stack trace shows the generated code; Note that the type of {{input}} 
changes from a reader to a holder, causing code to be invalid:

{code:java}
public class ProjectorGen0 {

DrillBuf work0;
UnionVector vv1;
VarCharVector vv6;
DrillBuf work9;
VarCharVector vv11;
DrillBuf work14;
VarCharVector vv16;

public void doEval(int inIndex, int outIndex)
throws SchemaChangeException
{
{
UnionHolder out4 = new UnionHolder();
{
out4 .isSet = vv1 .getAccessor().isSet((inIndex));
if (out4 .isSet == 1) {
vv1 .getAccessor().get((inIndex), out4);
}
}
// start of eval portion of typeOf function. //
VarCharHolder out5 = new VarCharHolder();
{
final VarCharHolder out = new VarCharHolder();
UnionHolder input = out4;
DrillBuf buf = work0;
UnionFunctions$GetType_eval:
{
String typeName = input.getTypeString();
byte[] type = typeName.getBytes();

buf = buf.reallocIfNeeded(type.length);
buf.setBytes(0, type);
out.buffer = buf;
out.start = 0;
out.end = type.length;
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7501) Drill 1.17 sqlTypeOf for a Map now reports STRUCT

2019-12-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7501.

Resolution: Won't Fix

As explained on the dev list, the return value in this case was changed to 
match the preferred name {{STRUCT}} for what Drill has historically called a 
{{MAP}}. The name {{STRUCT}} is consistent with Hive.

> Drill 1.17 sqlTypeOf for a Map now reports STRUCT
> -
>
> Key: DRILL-7501
> URL: https://issues.apache.org/jira/browse/DRILL-7501
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> Drill 1.14 introduced the {{sqlTypeOf()}} function to workaround limits of 
> the {{typeof()}} function. {{sqlTypeOf()}} should return the name of the SQL 
> type for a column, using the type name that Drill uses.
> A query from page 163 of _Learning Apache Drill_:
> {code:sql}
> SELECT sqlTypeOf(`name`) AS name_type FROM `json/nested.json`;
> {code}
> Drill 1.14 results (correct):
> {noformat}
> ++
> | name_type  |
> ++
> | MAP|
> ++
> {noformat}
> Drill 1.17 results (incorrect):
> {noformat}
> +---+
> | name_type |
> +---+
> | STRUCT|
> +---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-5189) There's no documentation for the typeof() function

2019-12-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-5189.

Resolution: Duplicate

> There's no documentation for the typeof() function
> --
>
> Key: DRILL-5189
> URL: https://issues.apache.org/jira/browse/DRILL-5189
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Chris Westin
>Assignee: Bridget Bevens
>Priority: Major
>
> I looked through the documentation at https://drill.apache.org/docs/ under 
> SQL Reference > SQL Functions > ... and could not find any reference to 
> typeof(). Google searches only turned up a reference to DRILL-4204.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-6362) typeof() lies about types

2019-12-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-6362.

Resolution: Won't Fix

> typeof() lies about types
> -
>
> Key: DRILL-6362
> URL: https://issues.apache.org/jira/browse/DRILL-6362
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> Drill provides a {{typeof()}} function that returns the type of a column. 
> But, it seems to make up types. Consider the following input file:
> {noformat}
> {a: true}
> {a: false}
> {a: null}
> {noformat}
> Consider the following two queries:
> {noformat}
> SELECT a FROM `json/boolean.json`;
> ++
> |   a|
> ++
> | true   |
> | false  |
> | null   |
> ++
> > SELECT typeof(a) FROM `json/boolean.json`;
> +-+
> | EXPR$0  |
> +-+
> | BIT |
> | BIT |
> | NULL|
> +-+
> {noformat}
> Notice that the values are reported as BIT. But, I believe the actual type is 
> UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports 
> NULL instead of the actual type for the null value.
> Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} 
> to muddle the type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7501) Drill 1.17 sqlTypeOf for a Map now reports STRUCT

2019-12-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7501:
--

 Summary: Drill 1.17 sqlTypeOf for a Map now reports STRUCT
 Key: DRILL-7501
 URL: https://issues.apache.org/jira/browse/DRILL-7501
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


Drill 1.14 introduced the {{sqlTypeOf()}} function to workaround limits of the 
{{typeof()}} function. {{sqlTypeOf()}} should return the name of the SQL type 
for a column, using the type name that Drill uses.

A query from page 163 of _Learning Apache Drill_:

{code:sql}
SELECT sqlTypeOf(`name`) AS name_type FROM `json/nested.json`;
{code}

Drill 1.14 results (correct):

{noformat}
++
| name_type  |
++
| MAP|
++
{noformat}

Drill 1.17 results (incorrect):

{noformat}
+---+
| name_type |
+---+
| STRUCT|
+---+
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7500) CTAS to JSON omits the final newline

2019-12-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7500:
--

 Summary: CTAS to JSON omits the final newline
 Key: DRILL-7500
 URL: https://issues.apache.org/jira/browse/DRILL-7500
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


Try the query from page 160 of _Learning Apache Drill_:

{code:sql}
ALTER SESSION SET `store.format` = 'json';
CREATE TABLE `out/json-null` AS SELECT * FROM `json/null2.json`;
{code}

Then, {{cat}} the resulting file:

{noformat}
cat out/json-null/0_0_0.json 
{
  "custId" : 123,
  "name" : "Fred",
  "balance" : 123.45
} {
  "custId" : 125,
  "name" : "Barney"
}(base) paul@paul-linux:~/eclipse-workspace/drillbook/data$
{noformat}

Notice that the file is missing a final newline, and so the shell prompt is 
appended to the last closing bracket.

Expected the line to be terminated with a newline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type

2019-12-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7499:
--

 Summary: sqltypeof() function with an array returns "ARRAY", not 
type
 Key: DRILL-7499
 URL: https://issues.apache.org/jira/browse/DRILL-7499
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers


The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
limitations of the original {{typeof()}} function. The function is mentioned in 
_Learning Apache Drill_, Chapter 8, page 152:


{noformat}
ELECT sqlTypeOf(columns) AS cols_type,
   modeOf(columns) AS cols_mode
FROM `csv/cust.csv` LIMIT 1;

+++
| cols_type  | cols_mode  |
+++
| CHARACTER VARYING  | ARRAY  |
+++
{noformat}

When the same query is run against the just-released Drill 1.17, we get the 
*wrong* results:

{noformat}
+---+---+
| cols_type | cols_mode |
+---+---+
| ARRAY | ARRAY |
+---+---+
{noformat}

The definition of {{sqlTypeOf()}} is that it should return the type portion of 
the columns (type, mode) major type. Clearly, it is no longer doing so for 
arrays. As a result, there is no function to obtain the data type for arrays.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7498) Allow the storage plugin editor window to be resizable

2019-12-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7498:
--

 Summary: Allow the storage plugin editor window to be resizable
 Key: DRILL-7498
 URL: https://issues.apache.org/jira/browse/DRILL-7498
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers


Open the Drill Web Console. Click on the Storage tab. Pick a Storage Plugin and 
click Update.

The JSON appears in nicely formatted editor. On a typical-sized monitor, the 
edit box takes up only half the screen vertically. Since it really helps to see 
more of the JSON than this small window, it would be handy if the edit box 
offered a resizer, such as this very Jira edit box does.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7487) Retire unused OUT_OF_MEMORY iterator status

2019-12-12 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7487:
--

 Summary: Retire unused OUT_OF_MEMORY iterator status
 Key: DRILL-7487
 URL: https://issues.apache.org/jira/browse/DRILL-7487
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is 
that an operator can realize it has encountered memory pressure and ask its 
downstream operator to free up some memory. However, an inspection of the code 
shows that the status is actually sent in only one place 
({{UnorderedReceiverBatch}}), and then only in response to the operator hitting 
its allocator limit (which no other batch can do anything about.)

If an operator did choose to try to use this status, there are two key problems:

1. The operator must be able to suspend itself at any point that it might need 
memory. For example, an operator that allocates a dozen vectors must be able to 
stop on, say, the 9th vector, then resume at that point on the subsequent call 
to `next()`. The complexity of the state machine needed to do this is very high.
2. The *downstream* operators (who may not yet have seen rows) are the least 
likely to be able to release memory. It is the *upstream* operators (such as 
spillable operators) that might be able to spill some of the rows they are 
holding.

Presto suggests a nice alternative:

* An operator which encounters memory pressure asks the fragment executor for 
more memory.
* The fragment executor asks all *other* operators in that fragment to release 
memory if possible.

This allows a very simple memory recovery strategy:

{noformat}
  try {
// allocate something
  } catch (OutOfMemoryException e) {
context.requestMemory(this);
// allocate something again, throwing OOM if it fails again
  }
{noformat}

Proposed are two changes:

1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it 
is never sent.
2. Create a stub {{requestMemory()}} method in the operator context that does 
nothing now, but could be expanded to perform the work suggested above.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-5272) Text file reader is inefficient

2019-12-12 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-5272.

Resolution: Fixed

This issue was fixed when converting the text readers to use the result set 
loader framework.

> Text file reader is inefficient
> ---
>
> Key: DRILL-5272
> URL: https://issues.apache.org/jira/browse/DRILL-5272
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> From inspection of the ScanBatch and CompliantTextReader.
> Every batch holds about five implicit vectors. These are repeated for every 
> row, which can greatly increase incoming data size.
> When populating the vectors, the allocation starts at 8 bytes and grows to 16 
> bytes, causing a (slow) memory reallocation for every vector:
> {code}
> [org.apache.drill.exec.vector.UInt4Vector] - 
> Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [8] -> [16]
> {code}
> Whether due to the above, or a different issues is causing memory growth in 
> the scan batch:
> {code}
> Entry Memory: 6,456,448
> Exit Memory: 7,636,312
> Entry Memory: 7570560
> Exit Memory: 8750424
> ...
> {code}
> Evidently the implicit vectors are added in response to a "SELECT *" query. 
> Perhaps provide them only if actually requested.
> The vectors are populated for every row, making a copy of a potentially long 
> file name and path for every record. Since the values are common to every 
> record, perhaps we can use the same data copy for each, but have the offset 
> vector for each record just point to the single copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7486) Restructure row set reader builder

2019-12-12 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7486:
--

 Summary: Restructure row set reader builder
 Key: DRILL-7486
 URL: https://issues.apache.org/jira/browse/DRILL-7486
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


The code to build a row set reader is located in several places, and is tied to 
the {{RowSet}} class for historical reasons. This restructuring pulls out the 
code so it can be used from a {{VectorContainer}} or other source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7480) Revisit parameterized type design for Metadata API

2019-12-10 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7480:
--

 Summary: Revisit parameterized type design for Metadata API
 Key: DRILL-7480
 URL: https://issues.apache.org/jira/browse/DRILL-7480
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers


Grabbed latest master and found that the code will not build in Eclipse due to 
a type mismatch in the statistics code. Specifically, the problem is that we 
have several parameterized classes, but we often omit the parameters. 
Evidently, doing so is fine for some compilers, but is an error in Eclipse.

Then, while fixing the immediate issue, I found an opposite problem: code that 
would satisfy Eclipse, but which failed in the Maven build.

I spent time making another pass through the metadata code to add type 
parameters, remove "rawtypes" ignores and so on. See DRILL-7479.

Stepping back a bit, it seems that we are perhaps using the type parameters in 
a way that does not serve our needs in this particular case.

We have many classes that hold onto particular values of some type, such as 
{{StatisticsHolder}}, which can hold a String, a Double, etc. So, we 
parameterize.

But, after that, we treat the items generically. We don't care that {{foo}} is 
a {{StatisticsHolder}} and {{bar}} is {{StatisticsHolder}}, we 
just want to create, combine and work with lists of statistics.

The same is true in several other places such as column type, comparator type, 
etc. For comparators, we don't really care what type they compare, we just 
want, given two generic \{{StatisticsHolder}}s to get the corresponding 
comparator.

This is very similar to the situation with the "column accessors" in EVF: each 
column is a {{VARCHAR}} or a\{{ FLOAT8}}, but most code just treats them 
generically. So, the type-ness of the value was treated as data a runtime 
attribute, not a compile-time attribute.

This is a subtle point. Most code in Drill does not work with types directly in 
Java code. Instead, Drill is an interpreter: it works with generic objects 
which, at run time, resolve to actual typed objects. It is the difference 
between writing an application (directly uses types) and writing a language 
(generically works with all types.)

For example, a {{StatsticsHolder}} probably only needs to be type-aware at the 
moment it is populated or used, but not in all the generic column-level and 
table level code. (The same is true of properties in the column metadata class, 
as an example.)

IMHO, {{StatsticsHolder}} probably wants to be a non-parameterized class. It 
should have a declaration object that, say, provides the name, type, comparator 
and with other metadata. When the actual value is needed, a typed getter can be 
provided:
{code:java}
 T getValue();
{code}
As it is, the type system is very complex but we get no value. Since it is so 
complex, the code just punted and sprinkled raw types and ignores in many 
places, which defeats the purpose of parameterized types anyway.

Suggestion: let's revisit this work after the upcoming release and see if we 
can simplify it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7479) Short-term fixes for metadata API parameterized type issues

2019-12-10 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7479:
--

 Summary: Short-term fixes for metadata API parameterized type 
issues
 Key: DRILL-7479
 URL: https://issues.apache.org/jira/browse/DRILL-7479
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


See DRILL- for a discussion of the issues with how we currently use 
parameterized types in the metadata API.

This ticket is for short-term fixes that convert unsafe generic types of the 
form {{StatisticsHolder}} to the form {{StatisticsHolder}} so that the 
compiler does not complain with many warnings (and a few Eclipse-only errors.)

The topic should be revisited later in the context of DRILL-.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7303) Filter record batch does not handle zero-length batches

2019-11-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7303.

Resolution: Duplicate

> Filter record batch does not handle zero-length batches
> ---
>
> Key: DRILL-7303
> URL: https://issues.apache.org/jira/browse/DRILL-7303
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> Testing of the row-set-based JSON reader revealed a limitation of the Filter 
> record batch: if an incoming batch has zero records, the length of the 
> associated SV2 is left at -1. In particular:
> {code:java}
> public class SelectionVector2 implements AutoCloseable {
>   // Indicates actual number of rows in the RecordBatch
>   // container which owns this SV2 instance
>   private int batchActualRecordCount = -1;
> {code}
> Then:
> {code:java}
> public abstract class FilterTemplate2 implements Filterer {
>   @Override
>   public void filterBatch(int recordCount) throws SchemaChangeException{
> if (recordCount == 0) {
>   outgoingSelectionVector.setRecordCount(0);
>   return;
> }
> {code}
> Notice there is no call to set the actual record count. The solution is to 
> insert one line of code:
> {code:java}
> if (recordCount == 0) {
>   outgoingSelectionVector.setRecordCount(0);
>   outgoingSelectionVector.setBatchActualRecordCount(0); // <-- Add this
>   return;
> }
> {code}
> Without this, the query fails with an error due to an invalid index of -1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7311) Partial fixes for empty batch bugs

2019-11-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7311.

Resolution: Duplicate

> Partial fixes for empty batch bugs
> --
>
> Key: DRILL-7311
> URL: https://issues.apache.org/jira/browse/DRILL-7311
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.18.0
>
>
> DRILL-7305 explains that multiple operators have serious bugs when presented 
> with empty batches. DRILL-7306 explains that the EVF (AKA "new scan 
> framework") was originally coded to emit an empty "fast schema" batch, but 
> that the feature was disabled because of the many empty-batch operator 
> failures.
> This ticket covers a set of partial fixes for empty-batch issues. This is the 
> result of work done to get the converted JSON reader to work with a "fast 
> schema." The JSON work, in the end, revealed that Drill has too many bugs to 
> enable fast schema, and so the DRILL-7306 was implemented instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7305) Multiple operators do not handle empty batches

2019-11-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7305.

Resolution: Duplicate

> Multiple operators do not handle empty batches
> --
>
> Key: DRILL-7305
> URL: https://issues.apache.org/jira/browse/DRILL-7305
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Priority: Major
>
> While testing the new "EVF" framework, it was found that multiple operators 
> incorrectly handle empty batches. The EVF framework is set up to return a 
> "fast schema" empty batch with only schema as its first batch. It turns out 
> that many operators fail with problems such as:
> * Failure to set the value counts in the output container
> * Fail to initialize the offset vector position 0 to 0 for variable-width or 
> repeated vectors
> And so on.
> Partial fixes are in the JSON reader PR.
> For now, the easiest work-around is to disable the "fast schema" path in the 
> EVF: DRILL-7306.
> To discover the remaining issues, enable the 
> {{ScanOrchestratorBuilder.enableSchemaBatch}} option and run unit tests. You 
> can use the {{VectorChecker}} and {{VectorAccessorUtilities.verify()}} 
> methods to check state. Insert a call to {{verify()}} in each "next" method: 
> verify the incoming and outgoing batches. The checker only verifies a few 
> vector types; but these are enough to show many problems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7458) Base storage plugin framework

2019-11-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7458:
--

 Summary: Base storage plugin framework
 Key: DRILL-7458
 URL: https://issues.apache.org/jira/browse/DRILL-7458
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


The "Easy" framework allows third-parties to add format plugins to Drill with 
moderate effort. (The process could be easier, but "Easy" makes it as simple as 
possible given the current structure.)

At present, no such "starter" framework exists for storage plugins. Further, 
multiple storage plugins have implemented filter push down, seemingly by 
copying large blocks of code.

This ticket offers a "base" framework for storage plugins and for filter 
push-downs. The framework builds on the EVF, allowing plugins to also support 
project push down.

The framework has a "test mule" storage plugin to verify functionality, and was 
used as the basis of an REST-like plugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7457) Join assignment is random when table costa are identical

2019-11-22 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7457:
--

 Summary: Join assignment is random when table costa are identical
 Key: DRILL-7457
 URL: https://issues.apache.org/jira/browse/DRILL-7457
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


Create a simple test: a join between two identical scans, call them t1 and t2. 
Ensure that the scans report the same cost. Capture the logical plan. Repeat 
the exercise several times. You will see that Drill randomly assigns t1 to the 
left side or right side.

Operationally this might not make a difference. But, in tests, it means that 
trying to compare an "actual" and "golden" plan is impossible as the plans are 
unstable.

Also, if only the estimates are the same, but the table size differs, then 
runtime performance will randomly be better on some query runs than others.

Better is to fall back to SQL statement table order if the two tables are 
otherwise identical in cost.

This may be a Calcite issue rather than a Drill issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7456) Batch count fixes for 12 additional operators

2019-11-22 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7456:
--

 Summary: Batch count fixes for 12 additional operators
 Key: DRILL-7456
 URL: https://issues.apache.org/jira/browse/DRILL-7456
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


Enables batch validation for 12 additional operators:

* MergingRecordBatch
* OrderedPartitionRecordBatch
* RangePartitionRecordBatch
* TraceRecordBatch
* UnionAllRecordBatch
* UnorderedReceiverBatch
* UnpivotMapsRecordBatch
* WindowFrameRecordBatch
* TopNBatch
* HashJoinBatch
* ExternalSortBatch
* WriterRecordBatch

Fixes issues found with those checks so that this set of operators passes all 
checks.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   >