Re: [DISCUSS] Some ideas for Drill 1.21
The planning time has been extensively analyzed. It is inherent in a Volcano-style cost-based optimizer. This is a branch-and-bound search of an exponential design space. This bottleneck is very well understood. Further, it has been accelerated under specialized conditions. As part of OJAI, there was a limited form of Drill that was included that could work on specific kinds of tables built into MapR FS. With some rather severe truncations of the space that the optimizer had to search, the planning time could be reduced to tens of milliseconds. That was fine for a limited mission, but some of the really dramatic benefits of Drill on large queries across complex domains would be impossible with that truncated rule set. On Wed, Feb 9, 2022 at 7:06 PM Paul Rogers wrote: > Hi All, > > Would be great to understand the source of the slow planning. Back in the > day, I recall colleagues trying all kinds of things to speed up planning, > but without the time to really figure out where the time went. > > I wonder if the two points are related. If most of that planning time is > spent waiting for a plugin metadata, then James' & Charles' issue could > possibly be the cause of the slowness that Ted saw. > > James, it is still not clear what plugin metadata is being retrieved, and > when. Now, it is hard to figure that out; that code is complex. Ideally, if > you have a dozen plugins enabled, but query only one, then only that one > should be doing anything. Further, if you're using an external system (like > JDBC), the plugin should query the remote system tables only for the > table(s) you hit in your query. If the code asks ALL plugins for > information, or grabs all tables from the remote system, they, yeah, it's > going to be slow. > > Adding per-plugin caching might make sense. For JDBC, say, it is not likely > that the schema of the remote DB changes between queries, so caching for > some amount of time is probably fine. And, if a query asks for an unknown > column, the plugin could refresh metadata to see if the column was just > added. (I was told that Impala users constantly had to run REFRESH METADATA > to pick up new files added to HDFS.) > > For the classic, original use case (Parquet or CSV files on an HDFS-like > system), the problem was the need to scan the directory structure at plan > time to figure out which files to scan at run time. For Parquet, the > planner also wants to do Parquet row group pruning, which requires reading > the header of every one of the target files. Since this was slow, Drill > would create a quick & dirty cache, but with large numbers of files, even > reading that cache was slow (and, Drill would rebuild it any time a > directory changed, which greatly slowed planning.) > > For that classic use case, saved plans never seemed a win because the > "shape" of the query heavily depended on the WHERE clause: one clause might > hit a small set of files, another hit a large set, and that then throws off > join planning, hash/broadcast exchange decisions and so on. > > So, back to the suggestion to start with understanding where the time goes. > Any silly stuff we can just stop doing? Is the cost due to external > factors, such as those cited above? Or, is Calcite itself just heavy > weight? Calcite is a rules engine. Add more rules or more nodes in the DAG, > and the cost of planning rises steeply. So, are we fiddling about too much > in the planning process? > > One way to test: use a mock data source and plan-time components to > eliminate all external factors. Time various query shapes using EXPLAIN. > How long does Calcite take? If a long time, then we've got a rather > difficult problem as Calcite is hard to fix/replace. > > Then, time the plugins of interest. Figure out how to optimize those. > > My guess is that the bottleneck won't turn out to be what we think it is. > It usually isn't. > > - Paul > > On Tue, Feb 8, 2022 at 8:19 AM Ted Dunning wrote: > > > James, you make some good points. > > > > I would generally support what you say except for one special case. I > think > > that there is a case to be made to be able to cache query plans in some > > fashion. > > > > The traditional approach to do this is to use "prepared queries" by which > > the application signals that it is willing to trust that a query plan > will > > continue to be correct for the duration of its execution. My experience > > (and I think the industry's as well) is that the query plan is more > stable > > than the underlying details of the metadata and this level of caching (or > > more) is a very good idea. > > > > In particular, the benefit to Drill is that we have a very expensive > query > > planning phase (I have seen numbers in the range 200-800ms routinely) > but I > > have seen execution times that are as low as a few 10's of ms. This > > imbalance severely compromises the rate of concurrent querying for fast > > queries. Having some form of plan caching would allow this planning > >
[jira] [Resolved] (DRILL-8129) Storage-phoenix cannot resolve OSGi bundle apache-ds.jdbm1
[ https://issues.apache.org/jira/browse/DRILL-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Turton resolved DRILL-8129. - Resolution: Fixed > Storage-phoenix cannot resolve OSGi bundle apache-ds.jdbm1 > -- > > Key: DRILL-8129 > URL: https://issues.apache.org/jira/browse/DRILL-8129 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.20.0 >Reporter: James Turton >Assignee: James Turton >Priority: Blocker > Fix For: 1.20.0 > > > Because this dependency is of type "bundle", the module requires the > maven-bundle-plugin in order to resolve it, and for the module to build. > > {code:java} > [ERROR] Failed to execute goal on project drill-storage-phoenix: Could not > resolve dependencies for project > org.apache.drill.contrib:drill-storage-phoenix:jar:1.20.0-SNAPSHOT: Failure > to find org.apache.directory.jdbm:apacheds-jdbm1:bundle:2.0.0-M2 in > https://conjars.org/repo was cached in the local repository, resolution will > not be reattempted until the update interval of conjars has elapsed or > updates are forced -{code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [drill] jnturton merged pull request #2457: DRILL-8129: Storage-phoenix cannot resolve OSGi bundle apache-ds.jdbm1
jnturton merged pull request #2457: URL: https://github.com/apache/drill/pull/2457 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [DISCUSS] Some ideas for Drill 1.21
Hi All, Would be great to understand the source of the slow planning. Back in the day, I recall colleagues trying all kinds of things to speed up planning, but without the time to really figure out where the time went. I wonder if the two points are related. If most of that planning time is spent waiting for a plugin metadata, then James' & Charles' issue could possibly be the cause of the slowness that Ted saw. James, it is still not clear what plugin metadata is being retrieved, and when. Now, it is hard to figure that out; that code is complex. Ideally, if you have a dozen plugins enabled, but query only one, then only that one should be doing anything. Further, if you're using an external system (like JDBC), the plugin should query the remote system tables only for the table(s) you hit in your query. If the code asks ALL plugins for information, or grabs all tables from the remote system, they, yeah, it's going to be slow. Adding per-plugin caching might make sense. For JDBC, say, it is not likely that the schema of the remote DB changes between queries, so caching for some amount of time is probably fine. And, if a query asks for an unknown column, the plugin could refresh metadata to see if the column was just added. (I was told that Impala users constantly had to run REFRESH METADATA to pick up new files added to HDFS.) For the classic, original use case (Parquet or CSV files on an HDFS-like system), the problem was the need to scan the directory structure at plan time to figure out which files to scan at run time. For Parquet, the planner also wants to do Parquet row group pruning, which requires reading the header of every one of the target files. Since this was slow, Drill would create a quick & dirty cache, but with large numbers of files, even reading that cache was slow (and, Drill would rebuild it any time a directory changed, which greatly slowed planning.) For that classic use case, saved plans never seemed a win because the "shape" of the query heavily depended on the WHERE clause: one clause might hit a small set of files, another hit a large set, and that then throws off join planning, hash/broadcast exchange decisions and so on. So, back to the suggestion to start with understanding where the time goes. Any silly stuff we can just stop doing? Is the cost due to external factors, such as those cited above? Or, is Calcite itself just heavy weight? Calcite is a rules engine. Add more rules or more nodes in the DAG, and the cost of planning rises steeply. So, are we fiddling about too much in the planning process? One way to test: use a mock data source and plan-time components to eliminate all external factors. Time various query shapes using EXPLAIN. How long does Calcite take? If a long time, then we've got a rather difficult problem as Calcite is hard to fix/replace. Then, time the plugins of interest. Figure out how to optimize those. My guess is that the bottleneck won't turn out to be what we think it is. It usually isn't. - Paul On Tue, Feb 8, 2022 at 8:19 AM Ted Dunning wrote: > James, you make some good points. > > I would generally support what you say except for one special case. I think > that there is a case to be made to be able to cache query plans in some > fashion. > > The traditional approach to do this is to use "prepared queries" by which > the application signals that it is willing to trust that a query plan will > continue to be correct for the duration of its execution. My experience > (and I think the industry's as well) is that the query plan is more stable > than the underlying details of the metadata and this level of caching (or > more) is a very good idea. > > In particular, the benefit to Drill is that we have a very expensive query > planning phase (I have seen numbers in the range 200-800ms routinely) but I > have seen execution times that are as low as a few 10's of ms. This > imbalance severely compromises the rate of concurrent querying for fast > queries. Having some form of plan caching would allow this planning > overhead to drop to zero in select cases. > > I have been unable to even consider working on this problem, but it seems > that one interesting heuristic would be based on two factors: > - the ratio of execution time to planning time > The rationale is that if a query takes much longer to run than to plan, we > might as well do planning each time. Conversely, if a query takes much less > time to run than it takes to plan, it is very important to avoid that > planning time. > > - the degree to which recent execution times seem inconsistent with longer > history > The rationale here is that a persistent drop in performance for a query is > a strong indicator that any cached plan is no longer valid and should be > updated. Conversely, if recent query history is consistent with long-term > history, that is a vote of confidence for the plan. Furthermore, depending > on how this is implemented the chance of a false positive change detec
[GitHub] [drill] vvysotskyi commented on a change in pull request #2457: DRILL-8129: Storage-phoenix cannot resolve OSGi bundle apache-ds.jdbm1
vvysotskyi commented on a change in pull request #2457: URL: https://github.com/apache/drill/pull/2457#discussion_r803020109 ## File path: contrib/storage-phoenix/pom.xml ## @@ -326,6 +330,12 @@ -Xms2048m -Xmx2048m + +org.apache.felix Review comment: Ok, thanks for the explanation. Yes, looks like `MiniKdc` depends on this library so it cannot be excluded. Could you please add this plugin under the hadoop-2 profile? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [drill] jnturton commented on a change in pull request #2457: DRILL-8129: Storage-phoenix cannot resolve OSGi bundle apache-ds.jdbm1
jnturton commented on a change in pull request #2457: URL: https://github.com/apache/drill/pull/2457#discussion_r802998151 ## File path: contrib/storage-phoenix/pom.xml ## @@ -326,6 +330,12 @@ -Xms2048m -Xmx2048m + +org.apache.felix Review comment: @vvysotskyi Without it I cannot build storage-phoenix using `-Phadoop-2`: ``` [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 2.275 s (Wall Clock) [INFO] Finished at: 2022-02-09T19:44:28+02:00 [INFO] [ERROR] Failed to execute goal on project drill-storage-phoenix: Could not resolve dependencies for project org.apache.drill.contrib:drill-storage-phoenix:jar:1.20.0-SNAPSHOT: Failure to find org.apache.directory.jdbm:apacheds-jdbm1:bundle:2.0.0-M2 in https://conjars.org/repo was cached in the local repository, resolution will not be reattempted until the update interval of conjars has elapsed or updates are forced -> [Help 1] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [drill] jnturton commented on a change in pull request #2457: DRILL-8129: Storage-phoenix cannot resolve OSGi bundle apache-ds.jdbm1
jnturton commented on a change in pull request #2457: URL: https://github.com/apache/drill/pull/2457#discussion_r802998151 ## File path: contrib/storage-phoenix/pom.xml ## @@ -326,6 +330,12 @@ -Xms2048m -Xmx2048m + +org.apache.felix Review comment: Without it I cannot build storage-phoenix using `-Phadoop-2`: ``` [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 2.275 s (Wall Clock) [INFO] Finished at: 2022-02-09T19:44:28+02:00 [INFO] [ERROR] Failed to execute goal on project drill-storage-phoenix: Could not resolve dependencies for project org.apache.drill.contrib:drill-storage-phoenix:jar:1.20.0-SNAPSHOT: Failure to find org.apache.directory.jdbm:apacheds-jdbm1:bundle:2.0.0-M2 in https://conjars.org/repo was cached in the local repository, resolution will not be reattempted until the update interval of conjars has elapsed or updates are forced -> [Help 1] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [drill] vvysotskyi commented on a change in pull request #2457: DRILL-8129: Storage-phoenix cannot resolve OSGi bundle apache-ds.jdbm1
vvysotskyi commented on a change in pull request #2457: URL: https://github.com/apache/drill/pull/2457#discussion_r802978299 ## File path: contrib/storage-phoenix/pom.xml ## @@ -326,6 +330,12 @@ -Xms2048m -Xmx2048m + +org.apache.felix Review comment: Could you please clarify what is the reason for adding this plugin? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [drill] jnturton opened a new pull request #2457: DRILL-8129: Storage-phoenix cannot resolve OSGi bundle apache-ds.jdbm1
jnturton opened a new pull request #2457: URL: https://github.com/apache/drill/pull/2457 # [DRILL-8129](https://issues.apache.org/jira/browse/DRILL-8129): Storage-phoenix cannot resolve OSGi bundle apache-ds.jdbm1 ## Description Because this dependency is of type "bundle", the module requires the maven-bundle-plugin in order to resolve it, and for the module to build. ## Documentation N/A ## Testing Build Drill under the default profile and under -Phadoop-2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (DRILL-8129) Storage-phoenix cannot resolve OSGi bundle apache-ds.jdbm1
James Turton created DRILL-8129: --- Summary: Storage-phoenix cannot resolve OSGi bundle apache-ds.jdbm1 Key: DRILL-8129 URL: https://issues.apache.org/jira/browse/DRILL-8129 Project: Apache Drill Issue Type: Bug Affects Versions: 1.20.0 Reporter: James Turton Assignee: James Turton Fix For: 1.20.0 Because this dependency is of type "bundle", the module requires the maven-bundle-plugin in order to resolve it, and for the module to build. {code:java} [ERROR] Failed to execute goal on project drill-storage-phoenix: Could not resolve dependencies for project org.apache.drill.contrib:drill-storage-phoenix:jar:1.20.0-SNAPSHOT: Failure to find org.apache.directory.jdbm:apacheds-jdbm1:bundle:2.0.0-M2 in https://conjars.org/repo was cached in the local repository, resolution will not be reattempted until the update interval of conjars has elapsed or updates are forced -{code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
Re: [VOTE] Release Apache Drill 1.20.0 - RC1
Thanks everyone for testing. It turns out I've broken a couple of points of Git and Maven release protocol, in large part from my efforts to release two builds. An RC 2 is now being prepared. On 2022/02/09 18:53, Vova Vysotskyi wrote: Hi James! Thanks for doing the RC so rapidly! I was verifying the previous RC, and after announcing this one switched to it. The release should be based on the commit generated by Maven ([maven-release-plugin] prepare release drill-XXX), and for the previous release candidate that was so, but for this one, commit id refers to the previous commit: (DRILL-8126: Ignore OAuth Parameter in Storage Plugin). Running select * from sys.version; query from Drill also returns that commit, but in this case, it is strange that the version was 1.20.0 since those changes weren't committed in the branch which head is 73a829a5a0eb21fc35d6cfd878310b7069135ecd... Kind regards, Volodymyr Vysotskyi On 2022/02/08 14:54:36 James Turton wrote: Hi all Note from the release manager. I'll undertake to add an Hadoop 2 release candidate shortly. I have checked and the issue found in RC0 (DRILL-8126) is fixed. That is the only change between this RC and the previous one. - Thank, James I'd like to propose the second release candidate (RC1) of Apache Drill, version 1.20.0. The release candidate covers a total of 106 resolved JIRAs [1]. Thanks to everyone who contributed to this release. The tarball artifacts are hosted at [2] and the maven artifacts are hosted at [3]. This release candidate is based on commit 73a829a5a0eb21fc35d6cfd878310b7069135ecd located at [4]. Please download and try out the release. [ ] +1 [ ] +0 [ ] -1 [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12350301&projectId=12313820 [2] https://dist.apache.org/repos/dist/dev/drill/drill-1.20.0-rc1/ [3] https://repository.apache.org/content/repositories/orgapachedrill-1088/ [4] https://github.com/jnturton/drill/commits/drill-1.20.0
Re: [VOTE] Release Apache Drill 1.20.0 - RC1
Hi James! Thanks for doing the RC so rapidly! I was verifying the previous RC, and after announcing this one switched to it. The release should be based on the commit generated by Maven ([maven-release-plugin] prepare release drill-XXX), and for the previous release candidate that was so, but for this one, commit id refers to the previous commit: (DRILL-8126: Ignore OAuth Parameter in Storage Plugin). Running select * from sys.version; query from Drill also returns that commit, but in this case, it is strange that the version was 1.20.0 since those changes weren't committed in the branch which head is 73a829a5a0eb21fc35d6cfd878310b7069135ecd... Kind regards, Volodymyr Vysotskyi On 2022/02/08 14:54:36 James Turton wrote: > Hi all > > Note from the release manager. > > I'll undertake to add an Hadoop 2 release candidate shortly. I have > checked and the issue found in RC0 (DRILL-8126) is fixed. That is the > only change between this RC and the previous one. > > - Thank, James > > I'd like to propose the second release candidate (RC1) of Apache Drill, > version 1.20.0. > > The release candidate covers a total of 106 resolved JIRAs [1]. Thanks > to everyone who contributed to this release. > > The tarball artifacts are hosted at [2] and the maven artifacts are > hosted at [3]. > > This release candidate is based on commit > 73a829a5a0eb21fc35d6cfd878310b7069135ecd located at [4]. > > Please download and try out the release. > > [ ] +1 > [ ] +0 > [ ] -1 > > [1] > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12350301&projectId=12313820 > [2] https://dist.apache.org/repos/dist/dev/drill/drill-1.20.0-rc1/ > [3] https://repository.apache.org/content/repositories/orgapachedrill-1088/ > [4] https://github.com/jnturton/drill/commits/drill-1.20.0 >
Re: [VOTE] Release Apache Drill 1.20.0 - RC1
The storage-phoenix plugin put up an unexpected final fight when it was asked to build under the -Phadoop-2 profile but I think that all is now okay with the Hadoop 2 build. Tar balls https://dist.apache.org/repos/dist/dev/drill/drill-1.20.0-hadoop-2-rc1/ Git tag https://github.com/jnturton/drill/commits/drill-1.20.0-hadoop-2 Maven https://repository.apache.org/content/repositories/orgapachedrill-1089/ Please test this build too, especially if you have an Hadoop 2 environment handy. So far I have only been able to produce this additional build in the form of an entirely new release with version 1.20.0-hadoop-2. I did not see a way to avoid a new release given the tools we use today but if there are maven-release-plugin secrets that I need to be taught please don't hesitate to do that. A downside of the -hadoop-2 version number I've generated is that I believe that 1.20.0-hadoop-2 > 1.20.0 in Maven's eyes, an inequality which could possibly do something weird to someone out there without pinned dependency versions. An alternative to drill-1.20.0-hadoop-2 (version number modified) that was considered was drill-hadoop-2-1.20.0 (package name modified), we can discuss that if you'd like. James On 2022/02/08 16:54, James Turton wrote: Hi all Note from the release manager. I'll undertake to add an Hadoop 2 release candidate shortly. I have checked and the issue found in RC0 (DRILL-8126) is fixed. That is the only change between this RC and the previous one. - Thank, James I'd like to propose the second release candidate (RC1) of Apache Drill, version 1.20.0. The release candidate covers a total of 106 resolved JIRAs [1]. Thanks to everyone who contributed to this release. The tarball artifacts are hosted at [2] and the maven artifacts are hosted at [3]. This release candidate is based on commit 73a829a5a0eb21fc35d6cfd878310b7069135ecd located at [4]. Please download and try out the release. [ ] +1 [ ] +0 [ ] -1 [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12350301&projectId=12313820 [2] https://dist.apache.org/repos/dist/dev/drill/drill-1.20.0-rc1/ [3] https://repository.apache.org/content/repositories/orgapachedrill-1088/ [4] https://github.com/jnturton/drill/commits/drill-1.20.0
[GitHub] [drill] rymarm opened a new pull request #2456: DRILL-8122: Change kafka metadata obtaining due to KAFKA-5697
rymarm opened a new pull request #2456: URL: https://github.com/apache/drill/pull/2456 # [DRILL-8122](https://issues.apache.org/jira/browse/DRILL-8122): Change kafka metadata obtaining due to KAFKA-5697 ## Description [`Consumer#poll(long)`](https://javadoc.io/static/org.apache.kafka/kafka-clients/3.1.0/org/apache/kafka/clients/consumer/Consumer.html#poll-long-) is deprecated starting from kafka 2.0. In Drill, `Consumer#poll` is used in 2 places: 1. [By its direct purpose ](https://github.com/apache/drill/blob/15b2f52260e4f0026f2dfafa23c5d32e0fb66502/contrib/storage-kafka/src/main/java/org/apache/drill/exec/store/kafka/MessageIterator.java#L82) 2. As an only one way to make a Kafka consumer [update metadata ](https://github.com/apache/drill/blob/15b2f52260e4f0026f2dfafa23c5d32e0fb66502/contrib/storage-kafka/src/main/java/org/apache/drill/exec/store/kafka/KafkaGroupScan.java#L185) Kafka [hasn't implemented](https://cwiki.apache.org/confluence/display/KAFKA/KIP-505%3A+Add+new+public+method+to+only+update+assignment+metadata+in+consumer) a separate method to update metadata. And new implementation [Consumer#poll(Duration)](https://javadoc.io/static/org.apache.kafka/kafka-clients/3.1.0/org/apache/kafka/clients/consumer/Consumer.html#poll-java.time.Duration-) doesn't work with a hack that Drill use: `poll(0)`, due to changed logic: https://github.com/apache/kafka/pull/4855 . That is why I had to use a loop with a timeout to workaround the absent separate method. ## Documentation \- ## Testing Unit tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org