date:20220117

[GitHub] [drill] jnturton commented on pull request #1710: DRILL-7127: Updating hbase version for mapr profile

2022-01-17 Thread GitBox



jnturton commented on pull request #1710:
URL: https://github.com/apache/drill/pull/1710#issuecomment-1015148877


   @Agirish can we close or reassign this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] paul-rogers commented on a change in pull request #2419: DRILL-8085: EVF V2 support in the "Easy" format plugin

2022-01-17 Thread GitBox



paul-rogers commented on a change in pull request #2419:
URL: https://github.com/apache/drill/pull/2419#discussion_r786323997



##
File path: 
contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogFormatPlugin.java
##
@@ -40,18 +40,16 @@
   private static class HttpLogReaderFactory extends FileReaderFactory {
 
 private final HttpdLogFormatConfig config;
-private final int maxRecords;
 private final EasySubScan scan;
 
-private HttpLogReaderFactory(HttpdLogFormatConfig config, int maxRecords, 
EasySubScan scan) {
+private HttpLogReaderFactory(HttpdLogFormatConfig config, EasySubScan 
scan) {
   this.config = config;
-  this.maxRecords = maxRecords;
   this.scan = scan;
 }
 
 @Override
-public ManagedReader 
newReader() {
-  return new HttpdLogBatchReader(config, maxRecords, scan);
+public ManagedReader newReader(FileSchemaNegotiator negotiator) throws 
EarlyEofException {

Review comment:
   To see this in action, take a look at 
`TestScanBasics.testEOFOnFirstOpen()`. If the exception is thrown, the scan 
framework skips this reader and moves to the next. This exception reports that 
"Hey, I have no data and no schema; please ignore me." Doesn't happen very 
often, but this is a safety-valve when it does happen.
   
   Suppose that the file (or result set) is empty an the constructor does not 
throw an error. In that case, the scan framework calls `next()` which gathers 
no rows, and returns `false`, which indicates EOF. If there is no schema also, 
then this case is the same as if the `EarlyEofException` was thrown.
   
   On the other hand, for or things that can provide a schema, even without 
rows, then the first `next()` can build that schema, so we can return that 
downstream, even if there are no rows.
   
   With all of that, we can handle the various use cases:
   
   * Reader that has no data at all, and can't even get its act together enough 
to service a `next()` call: throw `EarlyEofException` from the constructor, and 
the reader is skipped. Example: a CSV file that existed at plan time, but is 
now gone.
   * Reader that has no data, but can provide a fixed schema. The constructor 
builds the schema and returns. The first call to `next()` returns `false`, with 
no data.
   * Reader that has no data, and can provide a schema without it, but only by 
retrieving results from somewhere else, such as a JDBC connection that can 
return a schema even if there are no rows. The constructor does nothing with 
schema. The first `read()` builds the schema, but provides no rows. That 
`next()` returns `false` so we send the schema, but no data, downstream.
   * Normal case: the reader either has a fixed schema in the constructor, or 
discovers the schema on the first `next()`, and also reads data until EOF, 
loading the data into batches, one per `next()` call.
   
   Sorry this is so complex! But, this is is all essential to fully support all 
the many crazy readers and the "schema-on-read, but only sometimes" world in 
which Drill operates.

##
File path: 
contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogBatchReader.java
##
@@ -40,36 +41,29 @@
 import java.io.InputStream;
 import java.io.InputStreamReader;
 
-public class HttpdLogBatchReader implements 
ManagedReader {
+public class HttpdLogBatchReader implements ManagedReader {
 
   private static final Logger logger = 
LoggerFactory.getLogger(HttpdLogBatchReader.class);
   public static final String RAW_LINE_COL_NAME = "_raw";
   public static final String MATCHED_COL_NAME = "_matched";
   private final HttpdLogFormatConfig formatConfig;
-  private final int maxRecords;
-  private final EasySubScan scan;
-  private HttpdParser parser;
-  private FileSplit split;
+  private final HttpdParser parser;
+  private final FileDescrip file;
   private InputStream fsStream;
-  private RowSetLoader rowWriter;
+  private final RowSetLoader rowWriter;
   private BufferedReader reader;
   private int lineNumber;
-  private CustomErrorContext errorContext;
-  private ScalarWriter rawLineWriter;
-  private ScalarWriter matchedWriter;
+  private final CustomErrorContext errorContext;
+  private final ScalarWriter rawLineWriter;
+  private final ScalarWriter matchedWriter;
   private int errorCount;
 
-
-  public HttpdLogBatchReader(HttpdLogFormatConfig formatConfig, int 
maxRecords, EasySubScan scan) {
+  public HttpdLogBatchReader(HttpdLogFormatConfig formatConfig, EasySubScan 
scan, FileSchemaNegotiator negotiator) {
 this.formatConfig = formatConfig;
-this.maxRecords = maxRecords;
-this.scan = scan;
-  }
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {

Review comment:
   There is a more fundamental consideration. Drill is distributed: we want 
to hold resources as short a time as possible. In the previous design, 
developers had to understand to not open files or obtain resources in

Re: [DISCUSS] Drill 2 and plug-in organisation

2022-01-17 Thread Paul Rogers

Hi Ted,

Thanks for the explanation, makes sense.

Ideally, the client side would be somewhat agnostic about the repo it pulls
from. In a corporate setting, it should pull from the "JFrog Repository"
that everyone seems to use (but which I know basically nothing.) Oh, lord,
a plugin architecture for the repo for the plugin architecture?

- Paul

On Mon, Jan 17, 2022 at 1:46 PM Ted Dunning  wrote:

>
> Paul,
>
> I understood your suggestion.  My point is that publishing to Maven
> central is a bit of a pain while publishing by posting to Github is nearly
> painless.  In particular, because Github inherently produces a relatively
> difficult to fake hash for each commit, referring to a dependency using
> that hash is relatively safe which saves a lot of agony regarding keys and
> trust.
>
> Further, Github or any comparable service provides the same "already
> exists" benefit as does Maven.
>
>
>
> On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers  wrote:
>
>> Hi Ted,
>>
>> Well said. Just to be clear, I wasn't suggesting that we use
>> Maven-the-build-tool to distribute plugins. Rather, I was simply observing
>> that building a global repo is a bit of a project and asked, "what could we
>> use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever
>> Linux repos? Maybe. Maven's repo? Why not?
>>
>> The idea would be that Drill might have a tool that says, "install the
>> FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts
>> the plugin in the proper plugins directory. In a cluster, either it does
>> that on every node, or the work is done as part of preparing a Docker
>> container which is then pushed to every node.
>>
>> The key thought is just to make the problem simpler by avoiding the need
>> to create and maintain a Drill-specific repo when we can barely have enough
>> resources to keep Drill itself afloat.
>>
>> None of this can happen, however, unless we clean up the plugin APIs and
>> ensure plugins can be built outside of the Drill repo. (That means, say,
>> that Drill needs an API library that resides in Maven.)
>>
>> There are probably many ways this has been done. Anyone know of any good
>> examples we can learn from?
>>
>> Thanks,
>>
>> - Paul
>>
>>
>> On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning 
>> wrote:
>>
>>>
>>> I don't think that Maven is a forced move just because Drill is in Java.
>>> It may be a good move, but it isn't a forgone conclusion. For one thing,
>>> the conventions that Maven uses are pretty hard-wired and it may be
>>> difficult to have a reliable deny-list of known problematic plugins.
>>> Publishing to Maven is more of a pain than simply pushing to github.
>>>
>>> The usability here is paramount both for the ultimate Drill user, but
>>> also for the writer of plugins.
>>>
>>>
>>>
>>> On Mon, Jan 17, 2022 at 5:06 AM James Turton  wrote:
>>>
 Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
 is probably better fit than GitHub for distribution?  If Drillbits can
 write to their jars/3rdparty directory then I can imagine Drill gaining
 the ability to fetch and install plugins itself without too much
 trouble, at least for Drill clusters with Internet access.
 "Sideloading" by downloading from Maven and copying manually would
 always remain possible.

 @Paul I'll try to get a little time with you to get some ideas about
 designing a plugin API.

 On 2022/01/14 23:20, Paul Rogers wrote:
 > Hi All,
 >
 > James raises an important issue, I've noticed that it used to be easy
 to
 > build and test Drill, now it is a struggle, because of the many odd
 > external dependencies we have introduced. That acts as a big damper on
 > contributions: none of us get paid enough to spend more time fighting
 > builds than developing the code...
 >
 > Ted is right that we need a good way to install plugins. There are two
 > parts. Ted is talking about the high-level part: make it easy to
 point to
 > some repo and use the plugin. Since Drill is Java, the Maven repo
 could be
 > a good mechanism. In-house stuff is often in an internal repo that
 does
 > whatever Maven needs.
 >
 > The reason that plugins are in the Drill project now is that Drill's
 "API"
 > is all of Drill. Plugins can (and some do) access all of Drill though
 the
 > fragment context. The API to Calcite and other parts of Drill are
 wide, and
 > tend to be tightly coupled with Drill internals. By contrast, other
 tools,
 > such as Presto/Trino, have defined very clean APIs that extensions
 use. In
 > Druid, everything is integrated via Google Guice and an extension can
 > replace any part of Druid (though, I'm not convinced that's actually
 a good
 > idea.) I'm sure there are others we can learn from.
 >
 > So, we need to define a plugin API for Drill. I started down that
 route a
 >

[GitHub] [drill] paul-rogers commented on issue #2421: [DISCUSSION] ValueVectors Replacement

2022-01-17 Thread GitBox



paul-rogers commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1014927069


   @Leon-WTF, you ask a good question. I'm afraid the answer is rather complex: 
hard to describe in a few words. Read if you're interested.
   
   Let's take a simple query. We have 10 scan fragments feeding 10 sort 
fragments that feed a single merge/"screen" fragment. Drill uses a random 
exchange to shuffle data from the 10 scan fragments to the 10 sorts, so that 
every sort gets data from every reader, balancing the load. There is an ordered 
exchange from the sorts to the merge. In "Impala notation":
   
   ```text
   Screen
   |
   Merge
   |
   Ordered Receiver
   |   -   -   -   -   -   -   -   -
   Ordered Sender
   |
   Sort
   |
   Unordered Receiver
   |  -   -   -   -   -   -   -   -
   Random Sender
   |
   Scan 
   ```
   
   The dotted lines are an ad-hoc addition to represent fragment boundaries. 
Remember there are 10 each of the lower two fragments, 1 of the top.
   
   First,lets consider the in-memory case. The scans read data and forward 
batches to random sorts. The sorts sort each incoming batches, sorts each one, 
and buffer them. When the sort sees EOF from all its inputs, it merges the 
buffered batches and sends the data downstream to the merge, one batch at a 
time.
   
   The merge needs a batch from each sort to proceed, so the merge can't start 
until the last sort finishes. (This is why a Sort is called a "blocking 
operator" or "buffering operator": it won't produce its first result until it 
has consumed all its input batches.)
   
   Everything is in memory, so each Sort can consume batches about as fast as 
the scans can produce them. The sorts all work in parallel, as we'd hope. The 
merge kicks in last, but that's to be expected: one can't merge without having 
first sorted all the data.
   
   OK, now for the spilling case, where the problems seem to crop up. (Full 
disclosure: I rewrote the current version of the Sort spilling code.) Now, data 
is too large to fit into memory for the sort. Let's focus on one sort, call it 
Sort 1. Sort 1 reads input batches until it fills its buffer. Then, Sort 1 
pauses to write its sorted batches to disk. Simple enough. But, here is where 
things get tricky.
   
   All the scans must fill their output batches. Because we're doing random 
exchanges to the sorts, all outgoing batches are about the same size. Let's now 
pick one scan to focus on, Scan 2. Scan 2 fill up the outgoing batch for Sort 
1, but Sort 1 is busy spilling, so Sort 1 can't read that batch. As a result, 
Scan 2 is blocked: it needs to add one more row for Sort 1, but it can't 
because the outgoing batch is full, and the downstream Sort won't accept it. 
So, Scan 2 grinds to a halt.
   
   The same is true for all other scans: as soon as they want to send to Sort 1 
(which is spilling), they block. Soon, all scans are blocked. This means that 
all the other Sorts stop working: they have no incoming batches, so they are 
starved.
   
   Eventually, Sort 1 completes its spill and starts reading again. This means 
Scan 2 can send and start working again. The same is true of the other Scans. 
Now, the next Sort, Sort 2, needs to spill. (Remember, the data is randomly 
distributed, so all Sorts see about the same number of rows.) So, the whole 
show occurs again. The scans can't send to Sort 2, so they stall. The other 
scans become starved for inputs, and so they stop.
   
   Basically, the entire system becomes serialized on the sort spills: 
effectively, across the cluster only one spill will be active at any time, and 
that spill blocks senders which blocks the other sorts. We have a 10-way 
distributed, single-threaded query.
   
   Now, I've omitted some details. One is that, in Drill, every receiver is 
obligated to buffer three incoming batches before it blocks. So, the scenario 
above is not exactly right: there is buffering in the Unordered Receiver below 
each Sort. But, the net effect is the same once that three-batch buffer is 
filled.
   
   Even here there is an issue: remember: each scan hast to buffer 1024 for 
every sort. We have 10 scans and 10 sorts, so we're buffering 100 batches or 
100K rows. And, the sorts have buffers of 3 batches each, so that's another 30 
batches total, or 30K rows. All that consumes memory which is not available for 
the sort, hence the sort has to spill. That is, the memory design for Drill 
takes a narrow, per-operator view, and does not optimize memory use across the 
whole query.
   
   All of the above is conjecture based on watching large queries grind to a 
crawl. Drill provides overall metrics, but not metrics broken down by time 
slices, so we have no good way to visualize behavior. Using ASCII graphics, 
here's what we'd expect to see:
   
   ```text
   Sort 1: rrrsssr___r___rsss_
   Sort 2: rr___rsss__r___r___rsss
   
   Scan 1:

Re: [DISCUSS] Drill 2 and plug-in organisation

2022-01-17 Thread Ted Dunning

Paul,

I understood your suggestion.  My point is that publishing to Maven central
is a bit of a pain while publishing by posting to Github is nearly
painless.  In particular, because Github inherently produces a relatively
difficult to fake hash for each commit, referring to a dependency using
that hash is relatively safe which saves a lot of agony regarding keys and
trust.

Further, Github or any comparable service provides the same "already
exists" benefit as does Maven.



On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers  wrote:

> Hi Ted,
>
> Well said. Just to be clear, I wasn't suggesting that we use
> Maven-the-build-tool to distribute plugins. Rather, I was simply observing
> that building a global repo is a bit of a project and asked, "what could we
> use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever
> Linux repos? Maybe. Maven's repo? Why not?
>
> The idea would be that Drill might have a tool that says, "install the
> FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts
> the plugin in the proper plugins directory. In a cluster, either it does
> that on every node, or the work is done as part of preparing a Docker
> container which is then pushed to every node.
>
> The key thought is just to make the problem simpler by avoiding the need
> to create and maintain a Drill-specific repo when we can barely have enough
> resources to keep Drill itself afloat.
>
> None of this can happen, however, unless we clean up the plugin APIs and
> ensure plugins can be built outside of the Drill repo. (That means, say,
> that Drill needs an API library that resides in Maven.)
>
> There are probably many ways this has been done. Anyone know of any good
> examples we can learn from?
>
> Thanks,
>
> - Paul
>
>
> On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning  wrote:
>
>>
>> I don't think that Maven is a forced move just because Drill is in Java.
>> It may be a good move, but it isn't a forgone conclusion. For one thing,
>> the conventions that Maven uses are pretty hard-wired and it may be
>> difficult to have a reliable deny-list of known problematic plugins.
>> Publishing to Maven is more of a pain than simply pushing to github.
>>
>> The usability here is paramount both for the ultimate Drill user, but
>> also for the writer of plugins.
>>
>>
>>
>> On Mon, Jan 17, 2022 at 5:06 AM James Turton  wrote:
>>
>>> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
>>> is probably better fit than GitHub for distribution?  If Drillbits can
>>> write to their jars/3rdparty directory then I can imagine Drill gaining
>>> the ability to fetch and install plugins itself without too much
>>> trouble, at least for Drill clusters with Internet access.
>>> "Sideloading" by downloading from Maven and copying manually would
>>> always remain possible.
>>>
>>> @Paul I'll try to get a little time with you to get some ideas about
>>> designing a plugin API.
>>>
>>> On 2022/01/14 23:20, Paul Rogers wrote:
>>> > Hi All,
>>> >
>>> > James raises an important issue, I've noticed that it used to be easy
>>> to
>>> > build and test Drill, now it is a struggle, because of the many odd
>>> > external dependencies we have introduced. That acts as a big damper on
>>> > contributions: none of us get paid enough to spend more time fighting
>>> > builds than developing the code...
>>> >
>>> > Ted is right that we need a good way to install plugins. There are two
>>> > parts. Ted is talking about the high-level part: make it easy to point
>>> to
>>> > some repo and use the plugin. Since Drill is Java, the Maven repo
>>> could be
>>> > a good mechanism. In-house stuff is often in an internal repo that does
>>> > whatever Maven needs.
>>> >
>>> > The reason that plugins are in the Drill project now is that Drill's
>>> "API"
>>> > is all of Drill. Plugins can (and some do) access all of Drill though
>>> the
>>> > fragment context. The API to Calcite and other parts of Drill are
>>> wide, and
>>> > tend to be tightly coupled with Drill internals. By contrast, other
>>> tools,
>>> > such as Presto/Trino, have defined very clean APIs that extensions
>>> use. In
>>> > Druid, everything is integrated via Google Guice and an extension can
>>> > replace any part of Druid (though, I'm not convinced that's actually a
>>> good
>>> > idea.) I'm sure there are others we can learn from.
>>> >
>>> > So, we need to define a plugin API for Drill. I started down that
>>> route a
>>> > while back: the first step was to refactor the plugin registry so it is
>>> > ready for extensions. The idea was to use the same mechanism for all
>>> kinds
>>> > of extensions (security, UDFs, metastore, etc.) The next step was to
>>> build
>>> > something that roughly followed Presto, but that kind of stalled out.
>>> >
>>> > In terms of ordering, we'd first need to define the plugin API. Then,
>>> we
>>> > can shift plugins to use that. Once that is done, we can move plugins
>>> to
>>> > separate projects. (The metastore

Re: [DISCUSS] Drill 2 and plug-in organisation

2022-01-17 Thread Paul Rogers

Hi Ted,

Well said. Just to be clear, I wasn't suggesting that we use
Maven-the-build-tool to distribute plugins. Rather, I was simply observing
that building a global repo is a bit of a project and asked, "what could we
use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever
Linux repos? Maybe. Maven's repo? Why not?

The idea would be that Drill might have a tool that says, "install the
FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts
the plugin in the proper plugins directory. In a cluster, either it does
that on every node, or the work is done as part of preparing a Docker
container which is then pushed to every node.

The key thought is just to make the problem simpler by avoiding the need to
create and maintain a Drill-specific repo when we can barely have enough
resources to keep Drill itself afloat.

None of this can happen, however, unless we clean up the plugin APIs and
ensure plugins can be built outside of the Drill repo. (That means, say,
that Drill needs an API library that resides in Maven.)

There are probably many ways this has been done. Anyone know of any good
examples we can learn from?

Thanks,

- Paul


On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning  wrote:

>
> I don't think that Maven is a forced move just because Drill is in Java.
> It may be a good move, but it isn't a forgone conclusion. For one thing,
> the conventions that Maven uses are pretty hard-wired and it may be
> difficult to have a reliable deny-list of known problematic plugins.
> Publishing to Maven is more of a pain than simply pushing to github.
>
> The usability here is paramount both for the ultimate Drill user, but also
> for the writer of plugins.
>
>
>
> On Mon, Jan 17, 2022 at 5:06 AM James Turton  wrote:
>
>> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
>> is probably better fit than GitHub for distribution?  If Drillbits can
>> write to their jars/3rdparty directory then I can imagine Drill gaining
>> the ability to fetch and install plugins itself without too much
>> trouble, at least for Drill clusters with Internet access.
>> "Sideloading" by downloading from Maven and copying manually would
>> always remain possible.
>>
>> @Paul I'll try to get a little time with you to get some ideas about
>> designing a plugin API.
>>
>> On 2022/01/14 23:20, Paul Rogers wrote:
>> > Hi All,
>> >
>> > James raises an important issue, I've noticed that it used to be easy to
>> > build and test Drill, now it is a struggle, because of the many odd
>> > external dependencies we have introduced. That acts as a big damper on
>> > contributions: none of us get paid enough to spend more time fighting
>> > builds than developing the code...
>> >
>> > Ted is right that we need a good way to install plugins. There are two
>> > parts. Ted is talking about the high-level part: make it easy to point
>> to
>> > some repo and use the plugin. Since Drill is Java, the Maven repo could
>> be
>> > a good mechanism. In-house stuff is often in an internal repo that does
>> > whatever Maven needs.
>> >
>> > The reason that plugins are in the Drill project now is that Drill's
>> "API"
>> > is all of Drill. Plugins can (and some do) access all of Drill though
>> the
>> > fragment context. The API to Calcite and other parts of Drill are wide,
>> and
>> > tend to be tightly coupled with Drill internals. By contrast, other
>> tools,
>> > such as Presto/Trino, have defined very clean APIs that extensions use.
>> In
>> > Druid, everything is integrated via Google Guice and an extension can
>> > replace any part of Druid (though, I'm not convinced that's actually a
>> good
>> > idea.) I'm sure there are others we can learn from.
>> >
>> > So, we need to define a plugin API for Drill. I started down that route
>> a
>> > while back: the first step was to refactor the plugin registry so it is
>> > ready for extensions. The idea was to use the same mechanism for all
>> kinds
>> > of extensions (security, UDFs, metastore, etc.) The next step was to
>> build
>> > something that roughly followed Presto, but that kind of stalled out.
>> >
>> > In terms of ordering, we'd first need to define the plugin API. Then, we
>> > can shift plugins to use that. Once that is done, we can move plugins to
>> > separate projects. (The metastore implementation can also move, if we
>> > want.) Finally, figure out a solution for Ted's suggestion to make it
>> easy
>> > to grab new extensions. Drill is distributed, so adding a new plugin
>> has to
>> > happen on all nodes, which is a bit more complex than the typical
>> > Julia/Python/R kind of extension.
>> >
>> > The reason we're where we're at is that it is the path of least
>> resistance.
>> > Creating a good extension mechanism is hard, but valuable, as Ted noted.
>> >
>> > Thanks,
>> >
>> > - Paul
>> >
>> > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning
>> wrote:
>> >
>> >> The bigger reason for a separate plug-in world is the enhancement of
>> >>

Re: [DISCUSS] Drill 2 and plug-in organisation

2022-01-17 Thread Ted Dunning

I don't think that Maven is a forced move just because Drill is in Java. It
may be a good move, but it isn't a forgone conclusion. For one thing, the
conventions that Maven uses are pretty hard-wired and it may be difficult
to have a reliable deny-list of known problematic plugins. Publishing to
Maven is more of a pain than simply pushing to github.

The usability here is paramount both for the ultimate Drill user, but also
for the writer of plugins.



On Mon, Jan 17, 2022 at 5:06 AM James Turton  wrote:

> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
> is probably better fit than GitHub for distribution?  If Drillbits can
> write to their jars/3rdparty directory then I can imagine Drill gaining
> the ability to fetch and install plugins itself without too much
> trouble, at least for Drill clusters with Internet access.
> "Sideloading" by downloading from Maven and copying manually would
> always remain possible.
>
> @Paul I'll try to get a little time with you to get some ideas about
> designing a plugin API.
>
> On 2022/01/14 23:20, Paul Rogers wrote:
> > Hi All,
> >
> > James raises an important issue, I've noticed that it used to be easy to
> > build and test Drill, now it is a struggle, because of the many odd
> > external dependencies we have introduced. That acts as a big damper on
> > contributions: none of us get paid enough to spend more time fighting
> > builds than developing the code...
> >
> > Ted is right that we need a good way to install plugins. There are two
> > parts. Ted is talking about the high-level part: make it easy to point to
> > some repo and use the plugin. Since Drill is Java, the Maven repo could
> be
> > a good mechanism. In-house stuff is often in an internal repo that does
> > whatever Maven needs.
> >
> > The reason that plugins are in the Drill project now is that Drill's
> "API"
> > is all of Drill. Plugins can (and some do) access all of Drill though the
> > fragment context. The API to Calcite and other parts of Drill are wide,
> and
> > tend to be tightly coupled with Drill internals. By contrast, other
> tools,
> > such as Presto/Trino, have defined very clean APIs that extensions use.
> In
> > Druid, everything is integrated via Google Guice and an extension can
> > replace any part of Druid (though, I'm not convinced that's actually a
> good
> > idea.) I'm sure there are others we can learn from.
> >
> > So, we need to define a plugin API for Drill. I started down that route a
> > while back: the first step was to refactor the plugin registry so it is
> > ready for extensions. The idea was to use the same mechanism for all
> kinds
> > of extensions (security, UDFs, metastore, etc.) The next step was to
> build
> > something that roughly followed Presto, but that kind of stalled out.
> >
> > In terms of ordering, we'd first need to define the plugin API. Then, we
> > can shift plugins to use that. Once that is done, we can move plugins to
> > separate projects. (The metastore implementation can also move, if we
> > want.) Finally, figure out a solution for Ted's suggestion to make it
> easy
> > to grab new extensions. Drill is distributed, so adding a new plugin has
> to
> > happen on all nodes, which is a bit more complex than the typical
> > Julia/Python/R kind of extension.
> >
> > The reason we're where we're at is that it is the path of least
> resistance.
> > Creating a good extension mechanism is hard, but valuable, as Ted noted.
> >
> > Thanks,
> >
> > - Paul
> >
> > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning
> wrote:
> >
> >> The bigger reason for a separate plug-in world is the enhancement of
> >> community.
> >>
> >> I would recommend looking at the Julia community for examples of
> >> effective ways to drive plug in structure.
> >>
> >> At the core, for any pure julia package, you can simply add a package by
> >> referring to the github repository where the package is stored. For
> >> packages that are "registered" (i.e. a path and a checksum is recorded
> in a
> >> well known data store), you can add a package by simply naming it
> without
> >> knowing the path.  All such plugins are tested by the authors and the
> >> project records all dependencies with version constraints so that
> cascading
> >> additions are easy. The community leaders have made tooling available so
> >> that you can test your package against a range of versions of Julia by
> >> pretty simple (to use) Github actions.
> >>
> >> The result has been an absolute explosion in the number of pure Julia
> >> packages.
> >>
> >> For packages that include C or Fortran (or whatever) code, there is some
> >> amazing tooling available that lets you record a build process on any of
> >> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD,
> OSX
> >> and so on). WHen you register such a package, it is automagically built
> on
> >> all the platforms you indicate and the binary results are checked into a
> >> central repository known as Yggdrasil.

[GitHub] [drill] jnturton commented on pull request #2416: DRILL-8094: Support reverse truncation for split_part udf

2022-01-17 Thread GitBox



jnturton commented on pull request #2416:
URL: https://github.com/apache/drill/pull/2416#issuecomment-1014689249


   P.S. Had Google implemented a lazy Splitter that operates backwards from the 
end of the string there might have been a nice simplification.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] jnturton merged pull request #2416: DRILL-8094: Support reverse truncation for split_part udf

2022-01-17 Thread GitBox



jnturton merged pull request #2416:
URL: https://github.com/apache/drill/pull/2416


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] jnturton commented on a change in pull request #2416: DRILL-8094: Support reverse truncation for split_part udf

2022-01-17 Thread GitBox



jnturton commented on a change in pull request #2416:
URL: https://github.com/apache/drill/pull/2416#discussion_r776967381



##
File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java
##
@@ -416,16 +417,25 @@ public void setup() {
 
 @Override
 public void eval() {
-  if (index.value < 1) {
+  if (index.value == 0) {
 throw org.apache.drill.common.exceptions.UserException.functionError()
-  .message("Index in split_part must be positive, value provided was "
-+ index.value).build();
+  .message("Index in split_part can not be zero").build();
   }
   String inputString = org.apache.drill.exec.expr.fn.impl.
 StringFunctionHelpers.getStringFromVarCharHolder(in);
-  int arrayIndex = index.value - 1;
-  String result =
-  (String) 
com.google.common.collect.Iterables.get(splitter.split(inputString), 
arrayIndex, "");
+  String result = "";
+  if (index.value < 0) {
+java.util.List splits = splitter.splitToList(inputString);
+int size = splits.size();
+int arrayIndex = size + index.value;
+if (arrayIndex >= 0) {

Review comment:
   For performance we try to avoid branching inside function `eval` methods 
whenever possible.  Can any of these `if` statements be removed?  E.g. by using 
modular arithmetic to calculate the wrapped array index.

##
File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java
##
@@ -416,16 +417,25 @@ public void setup() {
 
 @Override
 public void eval() {
-  if (index.value < 1) {
+  if (index.value == 0) {
 throw org.apache.drill.common.exceptions.UserException.functionError()
-  .message("Index in split_part must be positive, value provided was "
-+ index.value).build();
+  .message("Index in split_part can not be zero").build();
   }
   String inputString = org.apache.drill.exec.expr.fn.impl.
 StringFunctionHelpers.getStringFromVarCharHolder(in);
-  int arrayIndex = index.value - 1;
-  String result =
-  (String) 
com.google.common.collect.Iterables.get(splitter.split(inputString), 
arrayIndex, "");
+  String result = "";
+  if (index.value < 0) {
+java.util.List splits = splitter.splitToList(inputString);
+int size = splits.size();
+int arrayIndex = size + index.value;
+if (arrayIndex >= 0) {

Review comment:
   Okay that does actually appear to be consistent with what happens when 
arrayIndex is past the end of the array.

##
File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java
##
@@ -416,16 +417,25 @@ public void setup() {
 
 @Override
 public void eval() {
-  if (index.value < 1) {
+  if (index.value == 0) {
 throw org.apache.drill.common.exceptions.UserException.functionError()
-  .message("Index in split_part must be positive, value provided was "
-+ index.value).build();
+  .message("Index in split_part can not be zero").build();
   }
   String inputString = org.apache.drill.exec.expr.fn.impl.
 StringFunctionHelpers.getStringFromVarCharHolder(in);
-  int arrayIndex = index.value - 1;
-  String result =
-  (String) 
com.google.common.collect.Iterables.get(splitter.split(inputString), 
arrayIndex, "");
+  String result = "";
+  if (index.value < 0) {
+java.util.List splits = splitter.splitToList(inputString);
+int size = splits.size();
+int arrayIndex = size + index.value;
+if (arrayIndex >= 0) {

Review comment:
   Perhaps in the context of this sort of string processing, branching 
penalities are not very significant 樂




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] jnturton commented on pull request #2416: DRILL-8094: Support reverse truncation for split_part udf

2022-01-17 Thread GitBox



jnturton commented on pull request #2416:
URL: https://github.com/apache/drill/pull/2416#issuecomment-1014685897


   Hi @Leon-WTF, sorry about the long delay here.  I wanted to try to remove 
`if` statements but hadn't noticed that lazy splitting is possible for postive 
index values making the cases more different than I'd realised.  I came up with 
an alternative implementation with the negative index case based on reversing 
the original string, and the delimiter, then doing lazy splitting in the 
forward direction and then reversing the selected part for the answer but in 
the end I think it was worse than what's here.
   
   I'll approve shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] jnturton commented on a change in pull request #2429: DRILL-8107: Hadoop2 backport Maven profile

2022-01-17 Thread GitBox



jnturton commented on a change in pull request #2429:
URL: https://github.com/apache/drill/pull/2429#discussion_r786005734



##
File path: exec/jdbc-all/pom.xml
##
@@ -874,6 +874,42 @@
 
   
 
+
+  hadoop-2
+  
+  
+
+  org.apache.maven.plugins
+  maven-enforcer-plugin
+  
+
+  enforce-jdbc-jar-compactness
+  
+enforce
+  
+  verify
+  
+
+  
+
+  The file drill-jdbc-all-${project.version}.jar is 
outside the expected size range.
+  This is likely due to you adding new dependencies to a 
java-exec and not updating the excludes in this module. This is important as it 
minimizes the size of the dependency of Drill application users.
+
+4760

Review comment:
   Sounds like something that we should make a property anyway.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] jnturton commented on pull request #2429: DRILL-8107: Hadoop2 backport Maven profile

2022-01-17 Thread GitBox



jnturton commented on pull request #2429:
URL: https://github.com/apache/drill/pull/2429#issuecomment-1014548764


   @vdiravka am I right that our CI builds won't test the new hadoop-2 profile 
automatically?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] jnturton edited a comment on pull request #2429: DRILL-8107: Hadoop2 backport Maven profile

2022-01-17 Thread GitBox



jnturton edited a comment on pull request #2429:
URL: https://github.com/apache/drill/pull/2429#issuecomment-1014342585


   @cgivre I've just added instructions for using this profile to
   
   https://drill.apache.org/docs/compiling-drill-from-source/
   
   .  Well, I thought I had.  Something's wrong with the website CI for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [DISCUSS] Drill 2 and plug-in organisation

2022-01-17 Thread James Turton

Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven 
is probably better fit than GitHub for distribution?  If Drillbits can 
write to their jars/3rdparty directory then I can imagine Drill gaining 
the ability to fetch and install plugins itself without too much 
trouble, at least for Drill clusters with Internet access.  
"Sideloading" by downloading from Maven and copying manually would 
always remain possible.


@Paul I'll try to get a little time with you to get some ideas about 
designing a plugin API.


On 2022/01/14 23:20, Paul Rogers wrote:

Hi All,

James raises an important issue, I've noticed that it used to be easy to
build and test Drill, now it is a struggle, because of the many odd
external dependencies we have introduced. That acts as a big damper on
contributions: none of us get paid enough to spend more time fighting
builds than developing the code...

Ted is right that we need a good way to install plugins. There are two
parts. Ted is talking about the high-level part: make it easy to point to
some repo and use the plugin. Since Drill is Java, the Maven repo could be
a good mechanism. In-house stuff is often in an internal repo that does
whatever Maven needs.

The reason that plugins are in the Drill project now is that Drill's "API"
is all of Drill. Plugins can (and some do) access all of Drill though the
fragment context. The API to Calcite and other parts of Drill are wide, and
tend to be tightly coupled with Drill internals. By contrast, other tools,
such as Presto/Trino, have defined very clean APIs that extensions use. In
Druid, everything is integrated via Google Guice and an extension can
replace any part of Druid (though, I'm not convinced that's actually a good
idea.) I'm sure there are others we can learn from.

So, we need to define a plugin API for Drill. I started down that route a
while back: the first step was to refactor the plugin registry so it is
ready for extensions. The idea was to use the same mechanism for all kinds
of extensions (security, UDFs, metastore, etc.) The next step was to build
something that roughly followed Presto, but that kind of stalled out.

In terms of ordering, we'd first need to define the plugin API. Then, we
can shift plugins to use that. Once that is done, we can move plugins to
separate projects. (The metastore implementation can also move, if we
want.) Finally, figure out a solution for Ted's suggestion to make it easy
to grab new extensions. Drill is distributed, so adding a new plugin has to
happen on all nodes, which is a bit more complex than the typical
Julia/Python/R kind of extension.

The reason we're where we're at is that it is the path of least resistance.
Creating a good extension mechanism is hard, but valuable, as Ted noted.

Thanks,

- Paul

On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning  wrote:


The bigger reason for a separate plug-in world is the enhancement of
community.

I would recommend looking at the Julia community for examples of
effective ways to drive plug in structure.

At the core, for any pure julia package, you can simply add a package by
referring to the github repository where the package is stored. For
packages that are "registered" (i.e. a path and a checksum is recorded in a
well known data store), you can add a package by simply naming it without
knowing the path.  All such plugins are tested by the authors and the
project records all dependencies with version constraints so that cascading
additions are easy. The community leaders have made tooling available so
that you can test your package against a range of versions of Julia by
pretty simple (to use) Github actions.

The result has been an absolute explosion in the number of pure Julia
packages.

For packages that include C or Fortran (or whatever) code, there is some
amazing tooling available that lets you record a build process on any of
the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD, OSX
and so on). WHen you register such a package, it is automagically built on
all the platforms you indicate and the binary results are checked into a
central repository known as Yggdrasil.

All of these registration events for different packages are recorded in a
central registry as I mentioned. That registry is recorded in Github as
well which makes it easy to propagate changes.



On Thu, Jan 13, 2022 at 8:45 PM James Turton  wrote:


Hello dev community

Discussions about reorganising the Drill source code to better position
the project to support plug-ins for the "long tail" of weird and
wonderful systems and data formats have been coming up here and there
for a few months, e.g. inhttps://github.com/apache/drill/pull/2359.

A view which I personally share is that adding too large a number and
variety of plug-ins to the main tree would create a lethal maintenance
burden for developers working there and lead down a road of accumulating
technical debt.  The Maven tricks we must employ to harmonise the
growing set of

[GitHub] [drill-site] jnturton merged pull request #22: DRILL-8094: Update doc for split_part

2022-01-17 Thread GitBox



jnturton merged pull request #22:
URL: https://github.com/apache/drill-site/pull/22


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] jnturton commented on pull request #2429: DRILL-8107: Hadoop2 backport Maven profile

2022-01-17 Thread GitBox



jnturton commented on pull request #2429:
URL: https://github.com/apache/drill/pull/2429#issuecomment-1014342585


   @cgivre I've just added instructions for using this profile to
   
   https://drill.apache.org/docs/compiling-drill-from-source/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] jnturton commented on pull request #1710: DRILL-7127: Updating hbase version for mapr profile

[GitHub] [drill] paul-rogers commented on a change in pull request #2419: DRILL-8085: EVF V2 support in the "Easy" format plugin

Re: [DISCUSS] Drill 2 and plug-in organisation

[GitHub] [drill] paul-rogers commented on issue #2421: [DISCUSSION] ValueVectors Replacement

Re: [DISCUSS] Drill 2 and plug-in organisation

Re: [DISCUSS] Drill 2 and plug-in organisation

Re: [DISCUSS] Drill 2 and plug-in organisation

[GitHub] [drill] jnturton commented on pull request #2416: DRILL-8094: Support reverse truncation for split_part udf

[GitHub] [drill] jnturton merged pull request #2416: DRILL-8094: Support reverse truncation for split_part udf

[GitHub] [drill] jnturton commented on a change in pull request #2416: DRILL-8094: Support reverse truncation for split_part udf

[GitHub] [drill] jnturton commented on pull request #2416: DRILL-8094: Support reverse truncation for split_part udf

[GitHub] [drill] jnturton commented on a change in pull request #2429: DRILL-8107: Hadoop2 backport Maven profile

[GitHub] [drill] jnturton commented on pull request #2429: DRILL-8107: Hadoop2 backport Maven profile

[GitHub] [drill] jnturton edited a comment on pull request #2429: DRILL-8107: Hadoop2 backport Maven profile

Re: [DISCUSS] Drill 2 and plug-in organisation

[GitHub] [drill-site] jnturton merged pull request #22: DRILL-8094: Update doc for split_part

[GitHub] [drill] jnturton commented on pull request #2429: DRILL-8107: Hadoop2 backport Maven profile

17 matches

Site Navigation

Mail list logo

Footer information