[GitHub] [drill] cgivre commented on issue #1892: DRILL-7437: Storage Plugin for Generic HTTP REST API

2020-04-02 Thread GitBox
cgivre commented on issue #1892: DRILL-7437: Storage Plugin for Generic HTTP 
REST API
URL: https://github.com/apache/drill/pull/1892#issuecomment-608197878
 
 
   @paul-rogers 
   I made some minor code cleanup and changes to the docs.  Commits are 
squashed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] paul-rogers commented on issue #1892: DRILL-7437: Storage Plugin for Generic HTTP REST API

2020-04-02 Thread GitBox
paul-rogers commented on issue #1892: DRILL-7437: Storage Plugin for Generic 
HTTP REST API
URL: https://github.com/apache/drill/pull/1892#issuecomment-608171729
 
 
   Added support to configure the HTTP proxy from environment variables or the 
`drill-override.conf` file. Added tests for this case. The HTTP storage plugin 
proxy config is added in "on top.". Updated the `README.md` file.
   
   During this work I did a full review of the code (and made any changes that 
were needed.)
   
   Suggestion: this is getting pretty far along. @cgivre, I suggest you take a 
final pass to ensure all is to your liking. Then, squash commits. After that, 
perhaps @arina-ielchiieva can give a final review and the approval. (Since I'm 
a co-author on this one, I don't want to approve my own changes.)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] paul-rogers opened a new pull request #2047: DRILL-7675: Work around for partitions sender memory use

2020-04-02 Thread GitBox
paul-rogers opened a new pull request #2047: DRILL-7675: Work around for 
partitions sender memory use
URL: https://github.com/apache/drill/pull/2047
 
 
   
   # [DRILL-7675](https://issues.apache.org/jira/browse/DRILL-7675): Work 
around for partitions sender memory use
   
   ## Description
   
   DRILL-7675 describes a combination of factors which exposed a flaw in the 
partition sender:
   
   * The partition sender holds one buffer for each of the receivers, resulting 
in n^2 buffers total in the system; all on a single machine for a one-node 
Drill.
   * Every buffer holds 1024 rows.
   * The size of each row depends on the row shape. In DRILL-7675, one table 
has 250+ columns, some nested within repeated maps. Since each needs a vector 
of 1024 values (or 5 * 1024 or even 5 * 5 * 1024), the total memory size is 
large.
   
   The result is that Drill attempts to allocate many GB of buffers. But, the 
actual data set is only 2 MB in size.
   
   DRILL-7686 describes the needed longer-term redesign. This PR includes a 
workaround: the ability to reduce the number of rows per send buffer as the 
receiver count increases. See Documentation below.
   
   By enabling the new option, the query will now run in the configuration that 
the user describes in DRILL-7675. The cost, however, is slower performance, 
which is exactly what the user was trying to prevent by enabling excessive 
parallelism. The best workaround in this case (at least with local files) is to 
go with default parallelism.
   
   Also includes a number of cleanup and diagnostic fixes found during the 
investigation.
   
   ## Documentation
   
   Adds a new system/session option to allow the buffer size to shrink linearly 
with the increase in slice count, over some limit: 
`exec.partition.mem_throttle`:
   
   * The default is 0, which leaves the current logic unchanged.
   * If set to a positive value, then when the slice count exceeds that amount, 
the buffer size per sender is reduced.
   * The reduction factor is 1 / (slice count - threshold), with a minimum 
batch size of 256 records.
   
   So, if we set the threshold at 2, and run 10 slices, each slice will get 
1024 / 8 = 256 records.
   
   This option controls memory, but at obvious cost of increasing overhead. One 
could argue that this is a good thing. As the number of senders increases, the 
number of records going to each sender decreases, which increases the time that 
batches must accumulate before they are sent.
   
   If the option is enabled, and buffer size reduction kicks in, you'll find an 
info-level log message which details the reduction:
   
   ```
   exec.partition.mem_throttle is set to 2: 10 receivers, reduced send buffer 
size from 1024 to 256 rows
   ```
   
   ## Testing
   
   Created an ad-hoc test using the query from DRILL-7675. Ran this test with a 
variety of options, including with the new option enabled and disabled. See 
DRILL-7675 for a full description of the analysis.
   
   Ran the query from DRILL-7675 in the Drill server using the Web console with 
the new option on and off (along with other variations.) Verified that, with 
the option off, performance is the same before and after the change. (3 sec on 
my machine.) Verified that, with the option on, the query completes even with 
excessive parallelism (though the query does run slower in that case.)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (DRILL-7690) Display (major) operators in fragment title bar in Web UI

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7690:
--

 Summary: Display (major) operators in fragment title bar in Web UI
 Key: DRILL-7690
 URL: https://issues.apache.org/jira/browse/DRILL-7690
 Project: Apache Drill
  Issue Type: Improvement
  Components: Web Server
Affects Versions: 1.17.0
Reporter: Paul Rogers


Run a query in the Drill Web Console. View the profile, Query tab. Scroll down 
to the list of fragments. You'll see a gray bar with a title such as

Major Fragment: 02-xx-xx

This section shows the timing of the fragments.

But, what is happening in this fragment? To find out we must scroll way down to 
the lower section where we see:


02-xx-00 - SINGLE_SENDER
02-xx-01 - SELECTION_VECTOR_REMOVER
02-xx-02 - LIMIT
02-xx-03 - SELECTION_VECTOR_REMOVER
02-xx-04 - TOP_N_SORT
02-xx-05 - UNORDERED_RECEIVER

The result is quite a bit of scroll down/scroll up.

This ticket asks to show the major operators in the fragment title. For 
example, for the above:

Major Fragment: 02-xx-xx (TOP_N_SORT, LIMIT)

The "minor" operators which are omitted (because they are not the focus of the 
fragment) include senders, receivers and the SVR.

Note that the operators should appear in data flow order (bottom to top).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7689) Do not save profiles for trivial queries

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7689:
--

 Summary: Do not save profiles for trivial queries
 Key: DRILL-7689
 URL: https://issues.apache.org/jira/browse/DRILL-7689
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


Drill saves a query profile for every query. Some queries are trivial; there is 
no useful information (for the user) in such queries. Examples include {{ALTER 
SESSION/SYSTEM}}, {{CREATE SCHEMA}}, and other internal commands.

Logic already exists to omit profiles for {{ALTER}} commands, but only if a 
session option is set. No ability exists to omit profiles for the other 
statements.

This ticket asks to:
 * Omit profiles for trivial commands by default. (Part of the task is to 
define the set of trivial commands.)
 * Provide an option to enable such profiles, primarily for use by developers 
when debugging the trivial commands.
 * If no profile is available, show a message to that effect in the Web UI 
where we currently display the profile number. Provide a link to the 
documentation page that explains why there is no profile (and how to use the 
above option to request a profile if needed.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7688) Provide web console option to see non-default options

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7688:
--

 Summary: Provide web console option to see non-default options
 Key: DRILL-7688
 URL: https://issues.apache.org/jira/browse/DRILL-7688
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


The Drill web console has evolved to become quite powerful. The Options page 
has many wonderful improvements over earlier versions. The "Default" button is 
a handy way to see which options have been set, and to reset options to their 
default values.

When testing and troubleshooting, it is helpful to identify those options which 
are not at their default values. Please add a filter at the top of the page for 
"non-default" in addition to the existing topic-based filters.

It may also be useful to add a bit more color to the "Default" button when an 
option is set. At present, the distinction is gray vs. black text which is 
better than it was. Would be better for there to be even more contrast so 
non-default values are easier to see.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7687) Inaccurate memory estimates in hash join

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7687:
--

 Summary: Inaccurate memory estimates in hash join
 Key: DRILL-7687
 URL: https://issues.apache.org/jira/browse/DRILL-7687
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.15.0
Reporter: Paul Rogers


See DRILL-7675. In this ticket, we tried to reproduce an OOM case in the 
partition sender. In so doing, we mucked with various parallelization options. 
The query has 2 MB of data, but at one point the query would fail to run 
because the hash join could not obtain enough memory (on a system with 8 GB of 
memory available.)

The problem is that the memory calculator sees a worst-case scenario: a row 
with 250+ columns. The hash join estimated it needed something like 650MB of 
memory to perform the join. (That is 650 MB per fragment, and there were 
multiple fragments.) Since there was insufficient memory, and the 
{{drill.exec.hashjoin.fallback.enabled}} option was disabled, the hash join 
failed before it even started.

Better would be to at least try the query. In this case, with 2MB of data, the 
query succeeds. (Had to enable the magic option to do so.)

Better also would be to use the estimated row counts when estimating memory 
use. Maybe better estimates for the amount of memory needed per row. (The data 
in question has multiple nested map arrays, causing cardinality estimates to 
grow by 5x at each level.)

Perhaps use the "batch sizing" mechanism to detect actual memory use by 
analyzing the incoming batch.

There is no obvious answer. However, the goal is clear: the query should 
succeed if the actual memory needed fits within that available; we should not 
fail proactively based on estimates of needed memory. (This what the 
{{drill.exec.hashjoin.fallback.enabled}} option does; perhaps it should be on 
by default.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7686) Excessive memory use in partition sender

2020-04-02 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7686:
--

 Summary: Excessive memory use in partition sender
 Key: DRILL-7686
 URL: https://issues.apache.org/jira/browse/DRILL-7686
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.14.0
Reporter: Paul Rogers


The Partition Sender in Drill is responsible to take a batch from fragment x, 
and send its rows to all other fragments f1, f2, ... fn. For example, when 
joining, fragment x might read from a portion of a file, hash the join key, and 
partition rows by hash key to the receiving fragments that join rows with that 
same key.

Since Drill is columnar, the sender needs to send a batch of columns to each 
receiver. To be efficient, that batch should contain a reasonable number of 
rows. The current default is 1024.

Drill creates buffers, one per sender, to gather the rows. Thus, each sender 
needs n buffers: one for each receiver.

Because Drill is symmetrical, there are n senders (scans). Since each maintains 
n send buffers, we have a total of n^2 buffers. That is, the amount of memory 
used by the partition sender grows with the square of the degree of parallelism 
for a query.

In addition, as seen in DRILL-7675, the size of the buffers is controlled not 
by Drill, but by the incoming data. The query in DRILL-7675 had a row with 260+ 
fields, some of which were map arrays.

The result is that the query, which processes 2 MB of data, runs out of memory 
when may GB are available. Drill is simply doing the math: n^2 buffers, each 
with 1024 rows, each with 250 fields, many with a cardinality of 5x (or 25x or 
125x, depending on array depth) of the row count. The result is a very large 
memory footprint.

There is no simple bug-fix solution: the design is inherently unbounded. This 
ticket asks to develop a new design. Some crude ideas:
 * Use a row-based format for sending to avoid columnar overhead.
 * Send rows as soon as they are available on the sender side; allow the 
receiver to do buffering.
 * If doing buffering, flush rows after x ms to avoid slowing the system. (The 
current approach waits for buffers to fill.)
 * Consolidate buffers on each sending node. (This is the Mux/DeMux approach 
which is in the code, but was never well understood, and has its own 
concurrency, memory ownership problems.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [drill] paul-rogers commented on a change in pull request #2044: DRILL-7678: Update Yauaa Dependency

2020-04-02 Thread GitBox
paul-rogers commented on a change in pull request #2044: DRILL-7678: Update 
Yauaa Dependency
URL: https://github.com/apache/drill/pull/2044#discussion_r402485830
 
 

 ##
 File path: 
contrib/udfs/src/main/java/org/apache/drill/exec/udfs/UserAgentAnalyzerProvider.java
 ##
 @@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.udfs;
+
+import nl.basjes.parse.useragent.UserAgentAnalyzer;
+
+public class UserAgentAnalyzerProvider {
+
+  public static UserAgentAnalyzer getInstance() {
+return UserAgentAnalyzerHolder.INSTANCE;
+  }
+
+  private static class UserAgentAnalyzerHolder {
+private static final UserAgentAnalyzer INSTANCE = 
UserAgentAnalyzer.newBuilder()
 
 Review comment:
   @arina-ielchiieva, thanks; that article is accurate in the case that class 
`Something` is loaded earlier than the first use of `getInstance()`. Maybe 
there are other static functions or constants that force earlier class loading.
   
   In this specific case, the only method is `getInstance()` and so the outer 
class won't be loaded until that time. This means that the inner and outer 
classes are load at essentially the same time: on that first call to 
`getInstance()`.
   
   We can commit his PR with the code as it is; it works fine. My concern, 
however, is that we use static instances all over, and we've kind of relied on 
the behavior I've been describing. To be consistent, we'd want to go and change 
all the other uses as well. But, doing so would be unnecessary work in cases, 
like this one, where a single layer works fine.
   
   Let's go ahead and commit this, then we can continue the discussion without 
slowing this PR. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


How Satori is using Drill

2020-04-02 Thread Charles Givre
Hello all Drillers,
Ben Herzberg posted this article and I thought I'd forward this to the greater 
Drill community!  Thanks Ben for the write up!

https://satoricyber.com/how-satori-uses-apache-drill-to-conquer-data-exploration-preparation/
 




[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2044: DRILL-7678: Update Yauaa Dependency

2020-04-02 Thread GitBox
arina-ielchiieva commented on a change in pull request #2044: DRILL-7678: 
Update Yauaa Dependency
URL: https://github.com/apache/drill/pull/2044#discussion_r402229769
 
 

 ##
 File path: 
contrib/udfs/src/main/java/org/apache/drill/exec/udfs/UserAgentAnalyzerProvider.java
 ##
 @@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.udfs;
+
+import nl.basjes.parse.useragent.UserAgentAnalyzer;
+
+public class UserAgentAnalyzerProvider {
+
+  public static UserAgentAnalyzer getInstance() {
+return UserAgentAnalyzerHolder.INSTANCE;
+  }
+
+  private static class UserAgentAnalyzerHolder {
+private static final UserAgentAnalyzer INSTANCE = 
UserAgentAnalyzer.newBuilder()
 
 Review comment:
   @paul-rogers here is some explanation that you might find useful: 
https://en.m.wikipedia.org/wiki/Initialization-on-demand_holder_idiom


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] nielsbasjes commented on a change in pull request #2044: DRILL-7678: Update Yauaa Dependency

2020-04-02 Thread GitBox
nielsbasjes commented on a change in pull request #2044: DRILL-7678: Update 
Yauaa Dependency
URL: https://github.com/apache/drill/pull/2044#discussion_r402180380
 
 

 ##
 File path: 
contrib/udfs/src/main/java/org/apache/drill/exec/udfs/UserAgentFunctions.java
 ##
 @@ -46,11 +46,14 @@
 DrillBuf outBuffer;
 
 @Workspace
-nl.basjes.parse.useragent.UserAgentAnalyzerDirect uaa;
+nl.basjes.parse.useragent.UserAgentAnalyzer uaa;
+
+@Workspace
+java.util.List allFields;
 
 public void setup() {
-  uaa = 
nl.basjes.parse.useragent.UserAgentAnalyzerDirect.newBuilder().dropTests().hideMatcherLoadStats().build();
-  uaa.getAllPossibleFieldNamesSorted();
+  uaa = org.apache.drill.exec.udfs.UserAgentAnalyzerProvider.getInstance();
+  allFields = uaa.getAllPossibleFieldNamesSorted();
 
 Review comment:
   The list of field names is essentially static.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] nielsbasjes commented on a change in pull request #2044: DRILL-7678: Update Yauaa Dependency

2020-04-02 Thread GitBox
nielsbasjes commented on a change in pull request #2044: DRILL-7678: Update 
Yauaa Dependency
URL: https://github.com/apache/drill/pull/2044#discussion_r402179337
 
 

 ##
 File path: 
contrib/udfs/src/main/java/org/apache/drill/exec/udfs/UserAgentAnalyzerProvider.java
 ##
 @@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.udfs;
+
+import nl.basjes.parse.useragent.UserAgentAnalyzer;
+
+public class UserAgentAnalyzerProvider {
+
+  public static UserAgentAnalyzer getInstance() {
+return UserAgentAnalyzerHolder.INSTANCE;
+  }
+
+  private static class UserAgentAnalyzerHolder {
+private static final UserAgentAnalyzer INSTANCE = 
UserAgentAnalyzer.newBuilder()
+.dropTests()
+.hideMatcherLoadStats()
+.immediateInitialization()
+.build();
+  }
+}
 
 Review comment:
   Fixed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services