[jira] [Created] (DRILL-5353) Merge "Project on Project" generated in physical plan stage
Chunhui Shi created DRILL-5353: -- Summary: Merge "Project on Project" generated in physical plan stage Key: DRILL-5353 URL: https://issues.apache.org/jira/browse/DRILL-5353 Project: Apache Drill Issue Type: Bug Reporter: Chunhui Shi Assignee: Chunhui Shi There is possibility physical plan stage we will get a project-on-project plan. But the ProjectMergeRule(DrillMergeProjectRule) is only for logical planning. We need to apply the rule in physical plan stage as well. And even after planning stage, the JoinPrelRenameVisitor could also inject extra Project which can be merged with (if there is one) Project underneath. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] drill pull request #783: DRILL-5324: Provide simplified column reader/writer...
GitHub user paul-rogers opened a pull request: https://github.com/apache/drill/pull/783 DRILL-5324: Provide simplified column reader/writer for use in tests The new "sub-operator" unit test framework provides simple ways to create row sets in code. This PR includes the column accessor code: * Interfaces for column accessors * Template for generated implementations * Base implementation used by the generated code * Factory class to create the proper reader or writer given a major type (type and cardinality) * Utilities for generic access, type conversions, etc. Many vector types can be mapped to an int for get and set. One key exception are the decimal types: decimals, by definition, require a different representation. In Java, that is `BigDecimal`. Added get, set and setSafe accessors as required for each decimal type that uses `BigDecimal` to hold data. Work remains to be done on other complex types: intervals and so on. This will be added incrementally as work proceeds. The generated code builds on the `valueVectorTypes.tdd` file, adding additional properties needed to generate the accessors. The PR also includes a number of code cleanups done while reviewing existing code. In particular `DecimalUtility` was very roughly formatted and thus hard to follow. You can merge this pull request into a Git repository by running: $ git pull https://github.com/paul-rogers/drill DRILL-5324 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/783.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #783 commit eb0b8bc33aeea27fd0aae582d19297bd0bda92e1 Author: Paul Rogers Date: 2017-03-11T07:03:23Z The PR includes the column accessor code: * Interfaces described above * Generated implementations * Base implementation used by the generated code * Factory class to create the proper reader or writer given a major type (type and cardinality) * Utilities for generic access, type conversions, etc. Many vector types can be mapped to an int for get and set. One key exception are the decimal types: decimals, by definition, require a different representation. In Java, that is `BigDecimal`. Added get, set and setSafe accessors as required for each decimal type that uses `BigDecimal` to hold data. Work remains to be done on other complex types: intervals and so on. This will be added incrementally as work proceeds. The generated code builds on the `valueVectorTypes.tdd` file, adding additional properties needed to generate the accessors. The PR also includes a number of code cleanups done while reviewing existing code. In particular `DecimalUtility` was very roughly formatted and thus hard to follow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #782: DRILL-5352: Profile parser printing for multi fragm...
GitHub user paul-rogers opened a pull request: https://github.com/apache/drill/pull/782 DRILL-5352: Profile parser printing for multi fragments Enhances the recently added ProfileParser to display run times for queries that contain multiple fragments. (The original version handled just a single fragment.) Prints the query in âclassicâ mode if it is linear, or in the new semi-indented mode if the query forms a tree. Also cleans up formatting - removing spaces between parens. You can merge this pull request into a Git repository by running: $ git pull https://github.com/paul-rogers/drill DRILL-5352 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/782.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #782 commit 6f07584ac0bf0778d1164ab0b169dfba27957a1d Author: Paul Rogers Date: 2017-03-14T03:43:25Z DRILL-5352: Profile parser printing for multi fragments Enhances the recently added ProfileParser to display run times for queries that contain multiple fragments. (The original version handled just a single fragment.) Prints the query in âclassicâ mode if it is linear, or in the new semi-indented mode if the query forms a tree. Also cleans up formatting - removing spaces between parens. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (DRILL-5352) Extend test framework profile parser printer for multi-fragment queries
Paul Rogers created DRILL-5352: -- Summary: Extend test framework profile parser printer for multi-fragment queries Key: DRILL-5352 URL: https://issues.apache.org/jira/browse/DRILL-5352 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.10.0 Reporter: Paul Rogers Assignee: Paul Rogers Priority: Minor Fix For: 1.11.0 The recently added test framework has a tool called the {{ProfileParser}} which started as a tool for analyzing run times of single-fragment queries. Over time, it evolved to compare planned and actual cost for multi-fragment queries. This ticket requests that multi-fagment support be added to the printing of run times. If a query is single-thread, print the query as in the prior version: {code} Op: 0 Screen Setup: 0 - 0%, 0% Process: 35 - 0%, 0% Wait:16 Memory: 10 Op: 1 Project Setup: 22 - 1%, 0% Process: 41 - 0%, 0% Memory: 5 ... {code} If the query is multi-fragment and forms a tree, use the format used to display planning vs. actual info: {code} 03-09 . . Project Setup:0 ms - 0%, 0% Process: 0 ms - 0%, 0% 03-10 . . HashJoin (HASH JOIN) Setup:0 ms - 0%, 0% Process: 5,097,619 ms - 326770%, 73% 03-12 . . . . Project Setup: 36 ms - 2%, 0% Process:180 ms - 11%, 0% {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] drill pull request #777: DRILL-5330: NPE in FunctionImplementationRegistry
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/777#discussion_r105792412 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java --- @@ -160,7 +168,7 @@ public DrillFuncHolder findDrillFunction(FunctionResolver functionResolver, Func FunctionResolver exactResolver = FunctionResolverFactory.getExactResolver(functionCall); DrillFuncHolder holder = exactResolver.getBestMatch(functions, functionCall); -if (holder == null) { +if (holder == null && useDynamicUdfs) { --- End diff -- Ah, now I see what's happening (I hope...) I pushed another commit that makes the suggested changes. I wonder, do we have any unit tests for the ambiguous-function case? The unit tests passed with both the original and this new version, so I wonder if we have a hole in our test coverage? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: Drill date & time types encoding
Thanks Parth! The date and time definitions are the “classic” ones, but conflict with the Drill documentation: http://drill.apache.org/docs/supported-data-types/ DATE Years, months, and days in -MM-DD format since 4713 BC TIME 24-hour based time before or after January 1, 2001 in hours, minutes, seconds format: HH:mm:ss Which is correct? If the documentation is wrong, we can file a JIRA to correct it. (It may not even be wrong, since one can convert from one to the other easily, it may just be misleading…) Also note that, according to C++, DATE and TIME and TIMESTAMP are exactly the same, but the TIME as as 32-bit number, could only hold about 2 years due to limited range. Also, according to SQL, DATE has no time zone, it is just a date. That is, 2016-03-13 is the same date in PST or GMT. If DATE were seconds since the UTC epoch, dates would be different in different time zones. So, I assume we use the Unix epoch, but without an implied UTC time zone as is usual for Linux and Windows timestamps? How does a TIMESTAMP differ from a DATE? Perhaps a TIMESTAMP is based on the epoch UTC while DATE has no implied time zone? Again, the documentation differs: INTERVAL (Internally, INTERVAL is represented as INTERVALDAY or INTERVALYEAR.) A day-time or year-month interval TIMESTAMP JDBC timestamp in year, month, date hour, minute, second, and optional milliseconds format: -MM-dd HH:mm:ss.SSS So, sounds like we have an INTERVALDAY and INTERVAL year, but do we or do we not have an INTERVAL? If anyone knows, please let me know, else I need to do some poking around... Thanks, - Paul On Mar 13, 2017, at 2:44 PM, Parth Chandra mailto:par...@apache.org>> wrote: Paul asked this and I'm posting here so someone who knows better can correct me if I'm wrong ( This is from my notes when I was young) DATE : Int64 : Milliseconds from Unix Epoch : 1/1/1970 00:00:00 TIME : Int32 : Milliseconds from midnight on 1/1/1970 TimeStampTZ : Int64 + Int32 : (Milliseconds from epoch + Index into list of TimeZones) TimeStamp : Int64 : Milliseconds from epoch Interval : Int32 + Int32 + Int32 : Month + Days + Milliseconds Interval Day : Int32 + Int32 : Days + Milliseconds Interval Year : Int32 : Month A slightly readable version of these can be found in the C++ client :). $drill_src/contrib/native/client/src/include/drill/recordbatch.hpp which has a bunch of 'Holder' structs for the date-time types. HTH Parth
[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...
GitHub user parthchandra opened a pull request: https://github.com/apache/drill/pull/781 DRILL-5351: Minimize bounds checking in var len vectors for Parquet reader Two changes in var len vectors: 1) Instead of checking to see if we need to realloc for every setSafe call, let the write fail and catch the exception. The exception, though expensive, will happen very rarely. 2) Call fillEmpties only if there are empty values to fill. This saves a bunch of CPU on every setSafe call. You can merge this pull request into a Git repository by running: $ git pull https://github.com/parthchandra/drill DRILL-5351 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/781.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #781 commit 57869496526a43351575d0f4879d2ac28fe973d4 Author: Parth Chandra Date: 2017-02-11T01:40:25Z DRILL-5351: Minimize bounds checking in var len vectors for Parquet reader --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Drill date & time types encoding
Paul asked this and I'm posting here so someone who knows better can correct me if I'm wrong ( This is from my notes when I was young) DATE : Int64 : Milliseconds from Unix Epoch : 1/1/1970 00:00:00 TIME : Int32 : Milliseconds from midnight on 1/1/1970 TimeStampTZ : Int64 + Int32 : (Milliseconds from epoch + Index into list of TimeZones) TimeStamp : Int64 : Milliseconds from epoch Interval : Int32 + Int32 + Int32 : Month + Days + Milliseconds Interval Day : Int32 + Int32 : Days + Milliseconds Interval Year : Int32 : Month A slightly readable version of these can be found in the C++ client :). $drill_src/contrib/native/client/src/include/drill/recordbatch.hpp which has a bunch of 'Holder' structs for the date-time types. HTH Parth
[RESULT] [VOTE] Release Apache Drill 1.10.0 rc0
The vote passes. Thanks to everyone who has tested the release candidate and given their comments and votes. Final tally: 3x +1 (binding): Aman, Parth, Jinfeng 2x +1 (non-binding): Arina, Gautam No 0s or -1s. I'll push the release artifacts and send an announcement once propagated. Thanks, Jinfeng
[jira] [Created] (DRILL-5351) Excessive bounds checking in the Parquet reader
Parth Chandra created DRILL-5351: Summary: Excessive bounds checking in the Parquet reader Key: DRILL-5351 URL: https://issues.apache.org/jira/browse/DRILL-5351 Project: Apache Drill Issue Type: Improvement Reporter: Parth Chandra In profiling the Parquet reader, the variable length decoding appears to be a major bottleneck making the reader CPU bound rather than disk bound. A yourkit profile indicates the following methods being severe bottlenecks - VarLenBinaryReader.determineSizeSerial(long) NullableVarBinaryVector$Mutator.setSafe(int, int, int, int, DrillBuf) DrillBuf.chk(int, int) NullableVarBinaryVector$Mutator.fillEmpties() The problem is that each of these methods does some form of bounds checking and eventually of course, the actual write to the ByteBuf is also bounds checked. DrillBuf.chk can be disabled by a configuration setting. Disabling this does improve performance of TPCH queries. In addition, all regression, unit, and TPCH-SF100 tests pass. I would recommend we allow users to turn this check off if there are performance critical queries. Removing the bounds checking at every level is going to be a fair amount of work. In the meantime, it appears that a few simple changes to variable length vectors improves query performance by about 10% across the board. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] drill pull request #780: DRILL-5349: Fix TestParquetWriter unit tests when s...
GitHub user parthchandra opened a pull request: https://github.com/apache/drill/pull/780 DRILL-5349: Fix TestParquetWriter unit tests when synchronous parquet⦠⦠reader is used. Seems like I removed some lines from the code that I should not have. This PR reinstates them. You can merge this pull request into a Git repository by running: $ git pull https://github.com/parthchandra/drill DRILL-5349 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/780.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #780 commit 65631ddd9ba2446f6cb07921c3f1740bd43f63f9 Author: Parth Chandra Date: 2017-03-10T22:38:30Z DRILL-5349: Fix TestParquetWriter unit tests when synchronous parquet reader is used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #777: DRILL-5330: NPE in FunctionImplementationRegistry
Github user arina-ielchiieva commented on a diff in the pull request: https://github.com/apache/drill/pull/777#discussion_r105628751 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java --- @@ -160,7 +168,7 @@ public DrillFuncHolder findDrillFunction(FunctionResolver functionResolver, Func FunctionResolver exactResolver = FunctionResolverFactory.getExactResolver(functionCall); DrillFuncHolder holder = exactResolver.getBestMatch(functions, functionCall); -if (holder == null) { +if (holder == null && useDynamicUdfs) { --- End diff -- 1. Since you have mentioned I remembered one more issue with FunctionImplementationRegistry, it can access only system options, so using `ExecConstants.USE_DYNAMIC_UDFS` won't work properly since it can be set at session level as well. I guess using bootsrap option you introduced is OK for now. Regarding your suggestion to have single option OFF, READ_ONLY and ON to handle the various cases (I love this idea!), we can try to implement this the scope of MVCC (I'll add this point to the document). 2. Even with boostrap option we need to update `findDrillFunction` to use provided function resolver when dynamic udfs are turned off (more details in my first comment). For example, `findDrillFunction` should can be re-written the following way (please optimize if needed): ```java public DrillFuncHolder findDrillFunction(FunctionResolver functionResolver, FunctionCall functionCall) { AtomicLong version = new AtomicLong(); String newFunctionName = functionReplacement(functionCall); List functions = localFunctionRegistry.getMethods(newFunctionName, version); if (!useDynamicUdfs) { return functionResolver.getBestMatch(functions, functionCall); } FunctionResolver exactResolver = FunctionResolverFactory.getExactResolver(functionCall); DrillFuncHolder holder = exactResolver.getBestMatch(functions, functionCall); if (holder == null) { syncWithRemoteRegistry(version.get()); List updatedFunctions = localFunctionRegistry.getMethods(newFunctionName, version); holder = functionResolver.getBestMatch(updatedFunctions, functionCall); } return holder; } ``` 3. Also changes should be done in `findExactMatchingDrillFunction` method to take into account boostrap option as well. For example (please optimize if needed): ```java public DrillFuncHolder findExactMatchingDrillFunction(String name, List argTypes, MajorType returnType) { if (useDynamicUdfs) { return findExactMatchingDrillFunction(name, argTypes, returnType, true); } return findExactMatchingDrillFunction(name, argTypes, returnType, false); } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #779: Indexr 0.3.0 drill 1.9.0
GitHub user xsq0718 opened a pull request: https://github.com/apache/drill/pull/779 Indexr 0.3.0 drill 1.9.0 You can merge this pull request into a Git repository by running: $ git pull https://github.com/shunfei/drill indexr-0.3.0-drill-1.9.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/779.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #779 commit ae0608b5c4b5ef9f897d6bc3a51f00f0b985bd60 Author: Sudheesh Katkam Date: 2016-11-11T23:37:35Z Revert "DRILL-4373: Drill and Hive have incompatible timestamp representations in parquet - added sys/sess option "store.parquet.int96_as_timestamp"; - added int96 to timestamp converter for both readers; - added unit tests;" This reverts commit 7e7214b40784668d1599f265067f789aedb6cf86. commit 4312d65bd5e0f68dc963ed722d0cdfd2628ea5f5 Author: Sudheesh Katkam Date: 2016-11-18T19:44:30Z [maven-release-plugin] prepare release drill-1.9.0 commit ab0648e06c0c65f56f82335526940a6b40c9218a Author: flow Date: 2017-01-05T09:38:38Z Add IndexR plugin. IndexR project: https://github.com/shunfei/indexr commit 05df10ddaaf5d3b921b3fad73cba7a1f94689d66 Author: flow Date: 2017-01-05T12:49:19Z IndexR plugin: fix code style check error, support java 7 commit a43a62bca7121a85495abec946d37ca4a3b5516e Author: flow Date: 2017-01-06T02:45:56Z IndexR plugin: fix plugin version commit 3481e35c91ead39ea430a7cb9c58b40429a1d9d8 Author: flow Date: 2017-01-22T02:51:09Z upgrate indexr version to 0.2.0 commit 5b06a0d0b636a4b33551b8a3adb1b843d5407ae7 Author: flow Date: 2017-01-24T02:31:53Z IndexR plugin bug fix: column name should be compared ignore case commit eaec28f4be399ace65ee9ef6df1e7f5239f952bc Author: flow Date: 2017-02-08T09:21:07Z try throw column not found with segment name commit ff3c68a047833b017dc1c02043e5ede437286dfc Author: flow Date: 2017-02-16T07:57:27Z UPDATE API: using SQLType commit 483bffbe7aaee156438412ac0a18d967b587dae6 Author: flow Date: 2017-02-16T08:43:18Z fix consume time issue commit 16f60ec3687fb5969c20b51c2380d3f582684fc3 Author: flow Date: 2017-02-20T02:39:18Z update indexr version to 0.2.1 commit d81eae753e38b006ef4be497bc9a77beb871bea8 Author: flow Date: 2017-02-20T02:47:29Z update version to 0.3.0-SNAPSHOT commit e80fb99724d3c48d27b83d87d9678ee2ec71f994 Author: flow Date: 2017-02-22T03:00:04Z Update API: rsFilter#roughCheckOnRow commit 54d5d6696fa1c4fac7a27d8342cc7186cb848abf Author: flow Date: 2017-02-28T08:17:11Z use LM when contains string fields commit a49019726de0a881803941c76082d3affc9d7c39 Author: flow Date: 2017-03-06T10:36:47Z update indexr.version to 0.3.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---