[GitHub] [drill] jnturton commented on pull request #1710: DRILL-7127: Updating hbase version for mapr profile
jnturton commented on pull request #1710: URL: https://github.com/apache/drill/pull/1710#issuecomment-1015148877 @Agirish can we close or reassign this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [drill] paul-rogers commented on a change in pull request #2419: DRILL-8085: EVF V2 support in the "Easy" format plugin
paul-rogers commented on a change in pull request #2419: URL: https://github.com/apache/drill/pull/2419#discussion_r786323997 ## File path: contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogFormatPlugin.java ## @@ -40,18 +40,16 @@ private static class HttpLogReaderFactory extends FileReaderFactory { private final HttpdLogFormatConfig config; -private final int maxRecords; private final EasySubScan scan; -private HttpLogReaderFactory(HttpdLogFormatConfig config, int maxRecords, EasySubScan scan) { +private HttpLogReaderFactory(HttpdLogFormatConfig config, EasySubScan scan) { this.config = config; - this.maxRecords = maxRecords; this.scan = scan; } @Override -public ManagedReader newReader() { - return new HttpdLogBatchReader(config, maxRecords, scan); +public ManagedReader newReader(FileSchemaNegotiator negotiator) throws EarlyEofException { Review comment: To see this in action, take a look at `TestScanBasics.testEOFOnFirstOpen()`. If the exception is thrown, the scan framework skips this reader and moves to the next. This exception reports that "Hey, I have no data and no schema; please ignore me." Doesn't happen very often, but this is a safety-valve when it does happen. Suppose that the file (or result set) is empty an the constructor does not throw an error. In that case, the scan framework calls `next()` which gathers no rows, and returns `false`, which indicates EOF. If there is no schema also, then this case is the same as if the `EarlyEofException` was thrown. On the other hand, for or things that can provide a schema, even without rows, then the first `next()` can build that schema, so we can return that downstream, even if there are no rows. With all of that, we can handle the various use cases: * Reader that has no data at all, and can't even get its act together enough to service a `next()` call: throw `EarlyEofException` from the constructor, and the reader is skipped. Example: a CSV file that existed at plan time, but is now gone. * Reader that has no data, but can provide a fixed schema. The constructor builds the schema and returns. The first call to `next()` returns `false`, with no data. * Reader that has no data, and can provide a schema without it, but only by retrieving results from somewhere else, such as a JDBC connection that can return a schema even if there are no rows. The constructor does nothing with schema. The first `read()` builds the schema, but provides no rows. That `next()` returns `false` so we send the schema, but no data, downstream. * Normal case: the reader either has a fixed schema in the constructor, or discovers the schema on the first `next()`, and also reads data until EOF, loading the data into batches, one per `next()` call. Sorry this is so complex! But, this is is all essential to fully support all the many crazy readers and the "schema-on-read, but only sometimes" world in which Drill operates. ## File path: contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogBatchReader.java ## @@ -40,36 +41,29 @@ import java.io.InputStream; import java.io.InputStreamReader; -public class HttpdLogBatchReader implements ManagedReader { +public class HttpdLogBatchReader implements ManagedReader { private static final Logger logger = LoggerFactory.getLogger(HttpdLogBatchReader.class); public static final String RAW_LINE_COL_NAME = "_raw"; public static final String MATCHED_COL_NAME = "_matched"; private final HttpdLogFormatConfig formatConfig; - private final int maxRecords; - private final EasySubScan scan; - private HttpdParser parser; - private FileSplit split; + private final HttpdParser parser; + private final FileDescrip file; private InputStream fsStream; - private RowSetLoader rowWriter; + private final RowSetLoader rowWriter; private BufferedReader reader; private int lineNumber; - private CustomErrorContext errorContext; - private ScalarWriter rawLineWriter; - private ScalarWriter matchedWriter; + private final CustomErrorContext errorContext; + private final ScalarWriter rawLineWriter; + private final ScalarWriter matchedWriter; private int errorCount; - - public HttpdLogBatchReader(HttpdLogFormatConfig formatConfig, int maxRecords, EasySubScan scan) { + public HttpdLogBatchReader(HttpdLogFormatConfig formatConfig, EasySubScan scan, FileSchemaNegotiator negotiator) { this.formatConfig = formatConfig; -this.maxRecords = maxRecords; -this.scan = scan; - } - @Override - public boolean open(FileSchemaNegotiator negotiator) { Review comment: There is a more fundamental consideration. Drill is distributed: we want to hold resources as short a time as possible. In the previous design, developers had to understand to not open files or obtain resources in
Re: [DISCUSS] Drill 2 and plug-in organisation
Hi Ted, Thanks for the explanation, makes sense. Ideally, the client side would be somewhat agnostic about the repo it pulls from. In a corporate setting, it should pull from the "JFrog Repository" that everyone seems to use (but which I know basically nothing.) Oh, lord, a plugin architecture for the repo for the plugin architecture? - Paul On Mon, Jan 17, 2022 at 1:46 PM Ted Dunning wrote: > > Paul, > > I understood your suggestion. My point is that publishing to Maven > central is a bit of a pain while publishing by posting to Github is nearly > painless. In particular, because Github inherently produces a relatively > difficult to fake hash for each commit, referring to a dependency using > that hash is relatively safe which saves a lot of agony regarding keys and > trust. > > Further, Github or any comparable service provides the same "already > exists" benefit as does Maven. > > > > On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers wrote: > >> Hi Ted, >> >> Well said. Just to be clear, I wasn't suggesting that we use >> Maven-the-build-tool to distribute plugins. Rather, I was simply observing >> that building a global repo is a bit of a project and asked, "what could we >> use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever >> Linux repos? Maybe. Maven's repo? Why not? >> >> The idea would be that Drill might have a tool that says, "install the >> FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts >> the plugin in the proper plugins directory. In a cluster, either it does >> that on every node, or the work is done as part of preparing a Docker >> container which is then pushed to every node. >> >> The key thought is just to make the problem simpler by avoiding the need >> to create and maintain a Drill-specific repo when we can barely have enough >> resources to keep Drill itself afloat. >> >> None of this can happen, however, unless we clean up the plugin APIs and >> ensure plugins can be built outside of the Drill repo. (That means, say, >> that Drill needs an API library that resides in Maven.) >> >> There are probably many ways this has been done. Anyone know of any good >> examples we can learn from? >> >> Thanks, >> >> - Paul >> >> >> On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning >> wrote: >> >>> >>> I don't think that Maven is a forced move just because Drill is in Java. >>> It may be a good move, but it isn't a forgone conclusion. For one thing, >>> the conventions that Maven uses are pretty hard-wired and it may be >>> difficult to have a reliable deny-list of known problematic plugins. >>> Publishing to Maven is more of a pain than simply pushing to github. >>> >>> The usability here is paramount both for the ultimate Drill user, but >>> also for the writer of plugins. >>> >>> >>> >>> On Mon, Jan 17, 2022 at 5:06 AM James Turton wrote: >>> Thank you Ted and Paul for the feedback. Since Java is compiled, Maven is probably better fit than GitHub for distribution? If Drillbits can write to their jars/3rdparty directory then I can imagine Drill gaining the ability to fetch and install plugins itself without too much trouble, at least for Drill clusters with Internet access. "Sideloading" by downloading from Maven and copying manually would always remain possible. @Paul I'll try to get a little time with you to get some ideas about designing a plugin API. On 2022/01/14 23:20, Paul Rogers wrote: > Hi All, > > James raises an important issue, I've noticed that it used to be easy to > build and test Drill, now it is a struggle, because of the many odd > external dependencies we have introduced. That acts as a big damper on > contributions: none of us get paid enough to spend more time fighting > builds than developing the code... > > Ted is right that we need a good way to install plugins. There are two > parts. Ted is talking about the high-level part: make it easy to point to > some repo and use the plugin. Since Drill is Java, the Maven repo could be > a good mechanism. In-house stuff is often in an internal repo that does > whatever Maven needs. > > The reason that plugins are in the Drill project now is that Drill's "API" > is all of Drill. Plugins can (and some do) access all of Drill though the > fragment context. The API to Calcite and other parts of Drill are wide, and > tend to be tightly coupled with Drill internals. By contrast, other tools, > such as Presto/Trino, have defined very clean APIs that extensions use. In > Druid, everything is integrated via Google Guice and an extension can > replace any part of Druid (though, I'm not convinced that's actually a good > idea.) I'm sure there are others we can learn from. > > So, we need to define a plugin API for Drill. I started down that route a >
[GitHub] [drill] paul-rogers commented on issue #2421: [DISCUSSION] ValueVectors Replacement
paul-rogers commented on issue #2421: URL: https://github.com/apache/drill/issues/2421#issuecomment-1014927069 @Leon-WTF, you ask a good question. I'm afraid the answer is rather complex: hard to describe in a few words. Read if you're interested. Let's take a simple query. We have 10 scan fragments feeding 10 sort fragments that feed a single merge/"screen" fragment. Drill uses a random exchange to shuffle data from the 10 scan fragments to the 10 sorts, so that every sort gets data from every reader, balancing the load. There is an ordered exchange from the sorts to the merge. In "Impala notation": ```text Screen | Merge | Ordered Receiver | - - - - - - - - Ordered Sender | Sort | Unordered Receiver | - - - - - - - - Random Sender | Scan ``` The dotted lines are an ad-hoc addition to represent fragment boundaries. Remember there are 10 each of the lower two fragments, 1 of the top. First,lets consider the in-memory case. The scans read data and forward batches to random sorts. The sorts sort each incoming batches, sorts each one, and buffer them. When the sort sees EOF from all its inputs, it merges the buffered batches and sends the data downstream to the merge, one batch at a time. The merge needs a batch from each sort to proceed, so the merge can't start until the last sort finishes. (This is why a Sort is called a "blocking operator" or "buffering operator": it won't produce its first result until it has consumed all its input batches.) Everything is in memory, so each Sort can consume batches about as fast as the scans can produce them. The sorts all work in parallel, as we'd hope. The merge kicks in last, but that's to be expected: one can't merge without having first sorted all the data. OK, now for the spilling case, where the problems seem to crop up. (Full disclosure: I rewrote the current version of the Sort spilling code.) Now, data is too large to fit into memory for the sort. Let's focus on one sort, call it Sort 1. Sort 1 reads input batches until it fills its buffer. Then, Sort 1 pauses to write its sorted batches to disk. Simple enough. But, here is where things get tricky. All the scans must fill their output batches. Because we're doing random exchanges to the sorts, all outgoing batches are about the same size. Let's now pick one scan to focus on, Scan 2. Scan 2 fill up the outgoing batch for Sort 1, but Sort 1 is busy spilling, so Sort 1 can't read that batch. As a result, Scan 2 is blocked: it needs to add one more row for Sort 1, but it can't because the outgoing batch is full, and the downstream Sort won't accept it. So, Scan 2 grinds to a halt. The same is true for all other scans: as soon as they want to send to Sort 1 (which is spilling), they block. Soon, all scans are blocked. This means that all the other Sorts stop working: they have no incoming batches, so they are starved. Eventually, Sort 1 completes its spill and starts reading again. This means Scan 2 can send and start working again. The same is true of the other Scans. Now, the next Sort, Sort 2, needs to spill. (Remember, the data is randomly distributed, so all Sorts see about the same number of rows.) So, the whole show occurs again. The scans can't send to Sort 2, so they stall. The other scans become starved for inputs, and so they stop. Basically, the entire system becomes serialized on the sort spills: effectively, across the cluster only one spill will be active at any time, and that spill blocks senders which blocks the other sorts. We have a 10-way distributed, single-threaded query. Now, I've omitted some details. One is that, in Drill, every receiver is obligated to buffer three incoming batches before it blocks. So, the scenario above is not exactly right: there is buffering in the Unordered Receiver below each Sort. But, the net effect is the same once that three-batch buffer is filled. Even here there is an issue: remember: each scan hast to buffer 1024 for every sort. We have 10 scans and 10 sorts, so we're buffering 100 batches or 100K rows. And, the sorts have buffers of 3 batches each, so that's another 30 batches total, or 30K rows. All that consumes memory which is not available for the sort, hence the sort has to spill. That is, the memory design for Drill takes a narrow, per-operator view, and does not optimize memory use across the whole query. All of the above is conjecture based on watching large queries grind to a crawl. Drill provides overall metrics, but not metrics broken down by time slices, so we have no good way to visualize behavior. Using ASCII graphics, here's what we'd expect to see: ```text Sort 1: rrrsssr___r___rsss_ Sort 2: rr___rsss__r___r___rsss Scan 1:
Re: [DISCUSS] Drill 2 and plug-in organisation
Paul, I understood your suggestion. My point is that publishing to Maven central is a bit of a pain while publishing by posting to Github is nearly painless. In particular, because Github inherently produces a relatively difficult to fake hash for each commit, referring to a dependency using that hash is relatively safe which saves a lot of agony regarding keys and trust. Further, Github or any comparable service provides the same "already exists" benefit as does Maven. On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers wrote: > Hi Ted, > > Well said. Just to be clear, I wasn't suggesting that we use > Maven-the-build-tool to distribute plugins. Rather, I was simply observing > that building a global repo is a bit of a project and asked, "what could we > use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever > Linux repos? Maybe. Maven's repo? Why not? > > The idea would be that Drill might have a tool that says, "install the > FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts > the plugin in the proper plugins directory. In a cluster, either it does > that on every node, or the work is done as part of preparing a Docker > container which is then pushed to every node. > > The key thought is just to make the problem simpler by avoiding the need > to create and maintain a Drill-specific repo when we can barely have enough > resources to keep Drill itself afloat. > > None of this can happen, however, unless we clean up the plugin APIs and > ensure plugins can be built outside of the Drill repo. (That means, say, > that Drill needs an API library that resides in Maven.) > > There are probably many ways this has been done. Anyone know of any good > examples we can learn from? > > Thanks, > > - Paul > > > On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning wrote: > >> >> I don't think that Maven is a forced move just because Drill is in Java. >> It may be a good move, but it isn't a forgone conclusion. For one thing, >> the conventions that Maven uses are pretty hard-wired and it may be >> difficult to have a reliable deny-list of known problematic plugins. >> Publishing to Maven is more of a pain than simply pushing to github. >> >> The usability here is paramount both for the ultimate Drill user, but >> also for the writer of plugins. >> >> >> >> On Mon, Jan 17, 2022 at 5:06 AM James Turton wrote: >> >>> Thank you Ted and Paul for the feedback. Since Java is compiled, Maven >>> is probably better fit than GitHub for distribution? If Drillbits can >>> write to their jars/3rdparty directory then I can imagine Drill gaining >>> the ability to fetch and install plugins itself without too much >>> trouble, at least for Drill clusters with Internet access. >>> "Sideloading" by downloading from Maven and copying manually would >>> always remain possible. >>> >>> @Paul I'll try to get a little time with you to get some ideas about >>> designing a plugin API. >>> >>> On 2022/01/14 23:20, Paul Rogers wrote: >>> > Hi All, >>> > >>> > James raises an important issue, I've noticed that it used to be easy >>> to >>> > build and test Drill, now it is a struggle, because of the many odd >>> > external dependencies we have introduced. That acts as a big damper on >>> > contributions: none of us get paid enough to spend more time fighting >>> > builds than developing the code... >>> > >>> > Ted is right that we need a good way to install plugins. There are two >>> > parts. Ted is talking about the high-level part: make it easy to point >>> to >>> > some repo and use the plugin. Since Drill is Java, the Maven repo >>> could be >>> > a good mechanism. In-house stuff is often in an internal repo that does >>> > whatever Maven needs. >>> > >>> > The reason that plugins are in the Drill project now is that Drill's >>> "API" >>> > is all of Drill. Plugins can (and some do) access all of Drill though >>> the >>> > fragment context. The API to Calcite and other parts of Drill are >>> wide, and >>> > tend to be tightly coupled with Drill internals. By contrast, other >>> tools, >>> > such as Presto/Trino, have defined very clean APIs that extensions >>> use. In >>> > Druid, everything is integrated via Google Guice and an extension can >>> > replace any part of Druid (though, I'm not convinced that's actually a >>> good >>> > idea.) I'm sure there are others we can learn from. >>> > >>> > So, we need to define a plugin API for Drill. I started down that >>> route a >>> > while back: the first step was to refactor the plugin registry so it is >>> > ready for extensions. The idea was to use the same mechanism for all >>> kinds >>> > of extensions (security, UDFs, metastore, etc.) The next step was to >>> build >>> > something that roughly followed Presto, but that kind of stalled out. >>> > >>> > In terms of ordering, we'd first need to define the plugin API. Then, >>> we >>> > can shift plugins to use that. Once that is done, we can move plugins >>> to >>> > separate projects. (The metastore
Re: [DISCUSS] Drill 2 and plug-in organisation
Hi Ted, Well said. Just to be clear, I wasn't suggesting that we use Maven-the-build-tool to distribute plugins. Rather, I was simply observing that building a global repo is a bit of a project and asked, "what could we use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever Linux repos? Maybe. Maven's repo? Why not? The idea would be that Drill might have a tool that says, "install the FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts the plugin in the proper plugins directory. In a cluster, either it does that on every node, or the work is done as part of preparing a Docker container which is then pushed to every node. The key thought is just to make the problem simpler by avoiding the need to create and maintain a Drill-specific repo when we can barely have enough resources to keep Drill itself afloat. None of this can happen, however, unless we clean up the plugin APIs and ensure plugins can be built outside of the Drill repo. (That means, say, that Drill needs an API library that resides in Maven.) There are probably many ways this has been done. Anyone know of any good examples we can learn from? Thanks, - Paul On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning wrote: > > I don't think that Maven is a forced move just because Drill is in Java. > It may be a good move, but it isn't a forgone conclusion. For one thing, > the conventions that Maven uses are pretty hard-wired and it may be > difficult to have a reliable deny-list of known problematic plugins. > Publishing to Maven is more of a pain than simply pushing to github. > > The usability here is paramount both for the ultimate Drill user, but also > for the writer of plugins. > > > > On Mon, Jan 17, 2022 at 5:06 AM James Turton wrote: > >> Thank you Ted and Paul for the feedback. Since Java is compiled, Maven >> is probably better fit than GitHub for distribution? If Drillbits can >> write to their jars/3rdparty directory then I can imagine Drill gaining >> the ability to fetch and install plugins itself without too much >> trouble, at least for Drill clusters with Internet access. >> "Sideloading" by downloading from Maven and copying manually would >> always remain possible. >> >> @Paul I'll try to get a little time with you to get some ideas about >> designing a plugin API. >> >> On 2022/01/14 23:20, Paul Rogers wrote: >> > Hi All, >> > >> > James raises an important issue, I've noticed that it used to be easy to >> > build and test Drill, now it is a struggle, because of the many odd >> > external dependencies we have introduced. That acts as a big damper on >> > contributions: none of us get paid enough to spend more time fighting >> > builds than developing the code... >> > >> > Ted is right that we need a good way to install plugins. There are two >> > parts. Ted is talking about the high-level part: make it easy to point >> to >> > some repo and use the plugin. Since Drill is Java, the Maven repo could >> be >> > a good mechanism. In-house stuff is often in an internal repo that does >> > whatever Maven needs. >> > >> > The reason that plugins are in the Drill project now is that Drill's >> "API" >> > is all of Drill. Plugins can (and some do) access all of Drill though >> the >> > fragment context. The API to Calcite and other parts of Drill are wide, >> and >> > tend to be tightly coupled with Drill internals. By contrast, other >> tools, >> > such as Presto/Trino, have defined very clean APIs that extensions use. >> In >> > Druid, everything is integrated via Google Guice and an extension can >> > replace any part of Druid (though, I'm not convinced that's actually a >> good >> > idea.) I'm sure there are others we can learn from. >> > >> > So, we need to define a plugin API for Drill. I started down that route >> a >> > while back: the first step was to refactor the plugin registry so it is >> > ready for extensions. The idea was to use the same mechanism for all >> kinds >> > of extensions (security, UDFs, metastore, etc.) The next step was to >> build >> > something that roughly followed Presto, but that kind of stalled out. >> > >> > In terms of ordering, we'd first need to define the plugin API. Then, we >> > can shift plugins to use that. Once that is done, we can move plugins to >> > separate projects. (The metastore implementation can also move, if we >> > want.) Finally, figure out a solution for Ted's suggestion to make it >> easy >> > to grab new extensions. Drill is distributed, so adding a new plugin >> has to >> > happen on all nodes, which is a bit more complex than the typical >> > Julia/Python/R kind of extension. >> > >> > The reason we're where we're at is that it is the path of least >> resistance. >> > Creating a good extension mechanism is hard, but valuable, as Ted noted. >> > >> > Thanks, >> > >> > - Paul >> > >> > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning >> wrote: >> > >> >> The bigger reason for a separate plug-in world is the enhancement of >> >>
Re: [DISCUSS] Drill 2 and plug-in organisation
I don't think that Maven is a forced move just because Drill is in Java. It may be a good move, but it isn't a forgone conclusion. For one thing, the conventions that Maven uses are pretty hard-wired and it may be difficult to have a reliable deny-list of known problematic plugins. Publishing to Maven is more of a pain than simply pushing to github. The usability here is paramount both for the ultimate Drill user, but also for the writer of plugins. On Mon, Jan 17, 2022 at 5:06 AM James Turton wrote: > Thank you Ted and Paul for the feedback. Since Java is compiled, Maven > is probably better fit than GitHub for distribution? If Drillbits can > write to their jars/3rdparty directory then I can imagine Drill gaining > the ability to fetch and install plugins itself without too much > trouble, at least for Drill clusters with Internet access. > "Sideloading" by downloading from Maven and copying manually would > always remain possible. > > @Paul I'll try to get a little time with you to get some ideas about > designing a plugin API. > > On 2022/01/14 23:20, Paul Rogers wrote: > > Hi All, > > > > James raises an important issue, I've noticed that it used to be easy to > > build and test Drill, now it is a struggle, because of the many odd > > external dependencies we have introduced. That acts as a big damper on > > contributions: none of us get paid enough to spend more time fighting > > builds than developing the code... > > > > Ted is right that we need a good way to install plugins. There are two > > parts. Ted is talking about the high-level part: make it easy to point to > > some repo and use the plugin. Since Drill is Java, the Maven repo could > be > > a good mechanism. In-house stuff is often in an internal repo that does > > whatever Maven needs. > > > > The reason that plugins are in the Drill project now is that Drill's > "API" > > is all of Drill. Plugins can (and some do) access all of Drill though the > > fragment context. The API to Calcite and other parts of Drill are wide, > and > > tend to be tightly coupled with Drill internals. By contrast, other > tools, > > such as Presto/Trino, have defined very clean APIs that extensions use. > In > > Druid, everything is integrated via Google Guice and an extension can > > replace any part of Druid (though, I'm not convinced that's actually a > good > > idea.) I'm sure there are others we can learn from. > > > > So, we need to define a plugin API for Drill. I started down that route a > > while back: the first step was to refactor the plugin registry so it is > > ready for extensions. The idea was to use the same mechanism for all > kinds > > of extensions (security, UDFs, metastore, etc.) The next step was to > build > > something that roughly followed Presto, but that kind of stalled out. > > > > In terms of ordering, we'd first need to define the plugin API. Then, we > > can shift plugins to use that. Once that is done, we can move plugins to > > separate projects. (The metastore implementation can also move, if we > > want.) Finally, figure out a solution for Ted's suggestion to make it > easy > > to grab new extensions. Drill is distributed, so adding a new plugin has > to > > happen on all nodes, which is a bit more complex than the typical > > Julia/Python/R kind of extension. > > > > The reason we're where we're at is that it is the path of least > resistance. > > Creating a good extension mechanism is hard, but valuable, as Ted noted. > > > > Thanks, > > > > - Paul > > > > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning > wrote: > > > >> The bigger reason for a separate plug-in world is the enhancement of > >> community. > >> > >> I would recommend looking at the Julia community for examples of > >> effective ways to drive plug in structure. > >> > >> At the core, for any pure julia package, you can simply add a package by > >> referring to the github repository where the package is stored. For > >> packages that are "registered" (i.e. a path and a checksum is recorded > in a > >> well known data store), you can add a package by simply naming it > without > >> knowing the path. All such plugins are tested by the authors and the > >> project records all dependencies with version constraints so that > cascading > >> additions are easy. The community leaders have made tooling available so > >> that you can test your package against a range of versions of Julia by > >> pretty simple (to use) Github actions. > >> > >> The result has been an absolute explosion in the number of pure Julia > >> packages. > >> > >> For packages that include C or Fortran (or whatever) code, there is some > >> amazing tooling available that lets you record a build process on any of > >> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD, > OSX > >> and so on). WHen you register such a package, it is automagically built > on > >> all the platforms you indicate and the binary results are checked into a > >> central repository known as Yggdrasil.
[GitHub] [drill] jnturton commented on pull request #2416: DRILL-8094: Support reverse truncation for split_part udf
jnturton commented on pull request #2416: URL: https://github.com/apache/drill/pull/2416#issuecomment-1014689249 P.S. Had Google implemented a lazy Splitter that operates backwards from the end of the string there might have been a nice simplification. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [drill] jnturton merged pull request #2416: DRILL-8094: Support reverse truncation for split_part udf
jnturton merged pull request #2416: URL: https://github.com/apache/drill/pull/2416 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [drill] jnturton commented on a change in pull request #2416: DRILL-8094: Support reverse truncation for split_part udf
jnturton commented on a change in pull request #2416: URL: https://github.com/apache/drill/pull/2416#discussion_r776967381 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java ## @@ -416,16 +417,25 @@ public void setup() { @Override public void eval() { - if (index.value < 1) { + if (index.value == 0) { throw org.apache.drill.common.exceptions.UserException.functionError() - .message("Index in split_part must be positive, value provided was " -+ index.value).build(); + .message("Index in split_part can not be zero").build(); } String inputString = org.apache.drill.exec.expr.fn.impl. StringFunctionHelpers.getStringFromVarCharHolder(in); - int arrayIndex = index.value - 1; - String result = - (String) com.google.common.collect.Iterables.get(splitter.split(inputString), arrayIndex, ""); + String result = ""; + if (index.value < 0) { +java.util.List splits = splitter.splitToList(inputString); +int size = splits.size(); +int arrayIndex = size + index.value; +if (arrayIndex >= 0) { Review comment: For performance we try to avoid branching inside function `eval` methods whenever possible. Can any of these `if` statements be removed? E.g. by using modular arithmetic to calculate the wrapped array index. ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java ## @@ -416,16 +417,25 @@ public void setup() { @Override public void eval() { - if (index.value < 1) { + if (index.value == 0) { throw org.apache.drill.common.exceptions.UserException.functionError() - .message("Index in split_part must be positive, value provided was " -+ index.value).build(); + .message("Index in split_part can not be zero").build(); } String inputString = org.apache.drill.exec.expr.fn.impl. StringFunctionHelpers.getStringFromVarCharHolder(in); - int arrayIndex = index.value - 1; - String result = - (String) com.google.common.collect.Iterables.get(splitter.split(inputString), arrayIndex, ""); + String result = ""; + if (index.value < 0) { +java.util.List splits = splitter.splitToList(inputString); +int size = splits.size(); +int arrayIndex = size + index.value; +if (arrayIndex >= 0) { Review comment: Okay that does actually appear to be consistent with what happens when arrayIndex is past the end of the array. ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java ## @@ -416,16 +417,25 @@ public void setup() { @Override public void eval() { - if (index.value < 1) { + if (index.value == 0) { throw org.apache.drill.common.exceptions.UserException.functionError() - .message("Index in split_part must be positive, value provided was " -+ index.value).build(); + .message("Index in split_part can not be zero").build(); } String inputString = org.apache.drill.exec.expr.fn.impl. StringFunctionHelpers.getStringFromVarCharHolder(in); - int arrayIndex = index.value - 1; - String result = - (String) com.google.common.collect.Iterables.get(splitter.split(inputString), arrayIndex, ""); + String result = ""; + if (index.value < 0) { +java.util.List splits = splitter.splitToList(inputString); +int size = splits.size(); +int arrayIndex = size + index.value; +if (arrayIndex >= 0) { Review comment: Perhaps in the context of this sort of string processing, branching penalities are not very significant 樂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [drill] jnturton commented on pull request #2416: DRILL-8094: Support reverse truncation for split_part udf
jnturton commented on pull request #2416: URL: https://github.com/apache/drill/pull/2416#issuecomment-1014685897 Hi @Leon-WTF, sorry about the long delay here. I wanted to try to remove `if` statements but hadn't noticed that lazy splitting is possible for postive index values making the cases more different than I'd realised. I came up with an alternative implementation with the negative index case based on reversing the original string, and the delimiter, then doing lazy splitting in the forward direction and then reversing the selected part for the answer but in the end I think it was worse than what's here. I'll approve shortly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [drill] jnturton commented on a change in pull request #2429: DRILL-8107: Hadoop2 backport Maven profile
jnturton commented on a change in pull request #2429: URL: https://github.com/apache/drill/pull/2429#discussion_r786005734 ## File path: exec/jdbc-all/pom.xml ## @@ -874,6 +874,42 @@ + + hadoop-2 + + + + org.apache.maven.plugins + maven-enforcer-plugin + + + enforce-jdbc-jar-compactness + +enforce + + verify + + + + + The file drill-jdbc-all-${project.version}.jar is outside the expected size range. + This is likely due to you adding new dependencies to a java-exec and not updating the excludes in this module. This is important as it minimizes the size of the dependency of Drill application users. + +4760 Review comment: Sounds like something that we should make a property anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [drill] jnturton commented on pull request #2429: DRILL-8107: Hadoop2 backport Maven profile
jnturton commented on pull request #2429: URL: https://github.com/apache/drill/pull/2429#issuecomment-1014548764 @vdiravka am I right that our CI builds won't test the new hadoop-2 profile automatically? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [drill] jnturton edited a comment on pull request #2429: DRILL-8107: Hadoop2 backport Maven profile
jnturton edited a comment on pull request #2429: URL: https://github.com/apache/drill/pull/2429#issuecomment-1014342585 @cgivre I've just added instructions for using this profile to https://drill.apache.org/docs/compiling-drill-from-source/ . Well, I thought I had. Something's wrong with the website CI for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [DISCUSS] Drill 2 and plug-in organisation
Thank you Ted and Paul for the feedback. Since Java is compiled, Maven is probably better fit than GitHub for distribution? If Drillbits can write to their jars/3rdparty directory then I can imagine Drill gaining the ability to fetch and install plugins itself without too much trouble, at least for Drill clusters with Internet access. "Sideloading" by downloading from Maven and copying manually would always remain possible. @Paul I'll try to get a little time with you to get some ideas about designing a plugin API. On 2022/01/14 23:20, Paul Rogers wrote: Hi All, James raises an important issue, I've noticed that it used to be easy to build and test Drill, now it is a struggle, because of the many odd external dependencies we have introduced. That acts as a big damper on contributions: none of us get paid enough to spend more time fighting builds than developing the code... Ted is right that we need a good way to install plugins. There are two parts. Ted is talking about the high-level part: make it easy to point to some repo and use the plugin. Since Drill is Java, the Maven repo could be a good mechanism. In-house stuff is often in an internal repo that does whatever Maven needs. The reason that plugins are in the Drill project now is that Drill's "API" is all of Drill. Plugins can (and some do) access all of Drill though the fragment context. The API to Calcite and other parts of Drill are wide, and tend to be tightly coupled with Drill internals. By contrast, other tools, such as Presto/Trino, have defined very clean APIs that extensions use. In Druid, everything is integrated via Google Guice and an extension can replace any part of Druid (though, I'm not convinced that's actually a good idea.) I'm sure there are others we can learn from. So, we need to define a plugin API for Drill. I started down that route a while back: the first step was to refactor the plugin registry so it is ready for extensions. The idea was to use the same mechanism for all kinds of extensions (security, UDFs, metastore, etc.) The next step was to build something that roughly followed Presto, but that kind of stalled out. In terms of ordering, we'd first need to define the plugin API. Then, we can shift plugins to use that. Once that is done, we can move plugins to separate projects. (The metastore implementation can also move, if we want.) Finally, figure out a solution for Ted's suggestion to make it easy to grab new extensions. Drill is distributed, so adding a new plugin has to happen on all nodes, which is a bit more complex than the typical Julia/Python/R kind of extension. The reason we're where we're at is that it is the path of least resistance. Creating a good extension mechanism is hard, but valuable, as Ted noted. Thanks, - Paul On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning wrote: The bigger reason for a separate plug-in world is the enhancement of community. I would recommend looking at the Julia community for examples of effective ways to drive plug in structure. At the core, for any pure julia package, you can simply add a package by referring to the github repository where the package is stored. For packages that are "registered" (i.e. a path and a checksum is recorded in a well known data store), you can add a package by simply naming it without knowing the path. All such plugins are tested by the authors and the project records all dependencies with version constraints so that cascading additions are easy. The community leaders have made tooling available so that you can test your package against a range of versions of Julia by pretty simple (to use) Github actions. The result has been an absolute explosion in the number of pure Julia packages. For packages that include C or Fortran (or whatever) code, there is some amazing tooling available that lets you record a build process on any of the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD, OSX and so on). WHen you register such a package, it is automagically built on all the platforms you indicate and the binary results are checked into a central repository known as Yggdrasil. All of these registration events for different packages are recorded in a central registry as I mentioned. That registry is recorded in Github as well which makes it easy to propagate changes. On Thu, Jan 13, 2022 at 8:45 PM James Turton wrote: Hello dev community Discussions about reorganising the Drill source code to better position the project to support plug-ins for the "long tail" of weird and wonderful systems and data formats have been coming up here and there for a few months, e.g. inhttps://github.com/apache/drill/pull/2359. A view which I personally share is that adding too large a number and variety of plug-ins to the main tree would create a lethal maintenance burden for developers working there and lead down a road of accumulating technical debt. The Maven tricks we must employ to harmonise the growing set of
[GitHub] [drill-site] jnturton merged pull request #22: DRILL-8094: Update doc for split_part
jnturton merged pull request #22: URL: https://github.com/apache/drill-site/pull/22 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [drill] jnturton commented on pull request #2429: DRILL-8107: Hadoop2 backport Maven profile
jnturton commented on pull request #2429: URL: https://github.com/apache/drill/pull/2429#issuecomment-1014342585 @cgivre I've just added instructions for using this profile to https://drill.apache.org/docs/compiling-drill-from-source/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org