date:20170313

[jira] [Created] (DRILL-5353) Merge "Project on Project" generated in physical plan stage

2017-03-13 Thread Chunhui Shi (JIRA)

Chunhui Shi created DRILL-5353:
--

 Summary: Merge "Project on Project" generated in physical plan 
stage
 Key: DRILL-5353
 URL: https://issues.apache.org/jira/browse/DRILL-5353
 Project: Apache Drill
  Issue Type: Bug
Reporter: Chunhui Shi
Assignee: Chunhui Shi


There is possibility physical plan stage we will get a project-on-project plan. 
But the ProjectMergeRule(DrillMergeProjectRule) is only for logical planning. 
We need to apply the rule in physical plan stage as well.

And even after planning stage, the JoinPrelRenameVisitor could also inject 
extra Project which can be merged with (if there is one) Project underneath.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[GitHub] drill pull request #783: DRILL-5324: Provide simplified column reader/writer...

2017-03-13 Thread paul-rogers

GitHub user paul-rogers opened a pull request:

https://github.com/apache/drill/pull/783

DRILL-5324: Provide simplified column reader/writer for use in tests

The new "sub-operator" unit test framework provides simple ways to create 
row sets in code. This PR includes the column accessor code:

* Interfaces for column accessors
* Template for generated implementations
* Base implementation used by the generated code
* Factory class to create the proper reader or writer given a major
type (type and cardinality)
* Utilities for generic access, type conversions, etc.

Many vector types can be mapped to an int for get and set. One key
exception are the decimal types: decimals, by definition, require a
different representation. In Java, that is `BigDecimal`. Added get, set
and setSafe accessors as required for each decimal type that uses
`BigDecimal` to hold data.

Work remains to be done on other complex types: intervals and so on.
This will be added incrementally as work proceeds.

The generated code builds on the `valueVectorTypes.tdd` file, adding
additional properties needed to generate the accessors.

The PR also includes a number of code cleanups done while reviewing
existing code. In particular `DecimalUtility` was very roughly
formatted and thus hard to follow.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/paul-rogers/drill DRILL-5324

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/783.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #783


commit eb0b8bc33aeea27fd0aae582d19297bd0bda92e1
Author: Paul Rogers 
Date:   2017-03-11T07:03:23Z

The PR includes the column accessor code:

* Interfaces described above
* Generated implementations
* Base implementation used by the generated code
* Factory class to create the proper reader or writer given a major
type (type and cardinality)
* Utilities for generic access, type conversions, etc.

Many vector types can be mapped to an int for get and set. One key
exception are the decimal types: decimals, by definition, require a
different representation. In Java, that is `BigDecimal`. Added get, set
and setSafe accessors as required for each decimal type that uses
`BigDecimal` to hold data.

Work remains to be done on other complex types: intervals and so on.
This will be added incrementally as work proceeds.

The generated code builds on the `valueVectorTypes.tdd` file, adding
additional properties needed to generate the accessors.

The PR also includes a number of code cleanups done while reviewing
existing code. In particular `DecimalUtility` was very roughly
formatted and thus hard to follow.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #782: DRILL-5352: Profile parser printing for multi fragm...

2017-03-13 Thread paul-rogers

GitHub user paul-rogers opened a pull request:

https://github.com/apache/drill/pull/782

DRILL-5352: Profile parser printing for multi fragments

Enhances the recently added ProfileParser to display run times for
queries that contain multiple fragments. (The original version handled
just a single fragment.)

Prints the query in âclassicâ mode if it is linear, or in the new
semi-indented mode if the query forms a tree.

Also cleans up formatting - removing spaces between parens.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/paul-rogers/drill DRILL-5352

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/782.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #782


commit 6f07584ac0bf0778d1164ab0b169dfba27957a1d
Author: Paul Rogers 
Date:   2017-03-14T03:43:25Z

DRILL-5352: Profile parser printing for multi fragments

Enhances the recently added ProfileParser to display run times for
queries that contain multiple fragments. (The original version handled
just a single fragment.)

Prints the query in âclassicâ mode if it is linear, or in the new
semi-indented mode if the query forms a tree.

Also cleans up formatting - removing spaces between parens.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (DRILL-5352) Extend test framework profile parser printer for multi-fragment queries

2017-03-13 Thread Paul Rogers (JIRA)

Paul Rogers created DRILL-5352:
--

 Summary: Extend test framework profile parser printer for 
multi-fragment queries
 Key: DRILL-5352
 URL: https://issues.apache.org/jira/browse/DRILL-5352
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.10.0
Reporter: Paul Rogers
Assignee: Paul Rogers
Priority: Minor
 Fix For: 1.11.0


The recently added test framework has a tool called the {{ProfileParser}} which 
started as a tool for analyzing run times of single-fragment queries. Over 
time, it evolved to compare planned and actual cost for multi-fragment queries.

This ticket requests that multi-fagment support be added to the printing of run 
times.

If a query is single-thread, print the query as in the prior version:

{code}
Op: 0 Screen
  Setup:   0 - 0%, 0%
  Process: 35 - 0%, 0%
  Wait:16
  Memory: 10
Op: 1 Project
  Setup:   22 - 1%, 0%
  Process: 41 - 0%, 0%
  Memory: 5
...
{code}

If the query is multi-fragment and forms a tree, use the format used to display 
planning vs. actual info:

{code}
03-09 . . Project
  Setup:0 ms -   0%,   0%
  Process:  0 ms -   0%,   0%
03-10 . . HashJoin (HASH JOIN)
  Setup:0 ms -   0%,   0%
  Process: 5,097,619 ms - 326770%,  73%
03-12 . . . . Project
  Setup:   36 ms -   2%,   0%
  Process:180 ms -  11%,   0%
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[GitHub] drill pull request #777: DRILL-5330: NPE in FunctionImplementationRegistry

2017-03-13 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/777#discussion_r105792412
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -160,7 +168,7 @@ public DrillFuncHolder 
findDrillFunction(FunctionResolver functionResolver, Func
 FunctionResolver exactResolver = 
FunctionResolverFactory.getExactResolver(functionCall);
 DrillFuncHolder holder = exactResolver.getBestMatch(functions, 
functionCall);
 
-if (holder == null) {
+if (holder == null && useDynamicUdfs) {
--- End diff --

Ah, now I see what's happening (I hope...) I pushed another commit that 
makes the suggested changes. 

I wonder, do we have any unit tests for the ambiguous-function case? The 
unit tests passed with both the original and this new version, so I wonder if 
we have a hole in our test coverage?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Re: Drill date & time types encoding

2017-03-13 Thread Paul Rogers

Thanks Parth!

The date and time definitions are the “classic” ones, but conflict with the 
Drill documentation:

http://drill.apache.org/docs/supported-data-types/

DATE Years, months, and days in -MM-DD format since 4713 BC

TIME 24-hour based time before or after January 1, 2001 in hours, minutes, 
seconds format: HH:mm:ss

Which is correct?

If the documentation is wrong, we can file a JIRA to correct it. (It may not 
even be wrong, since one can convert from one to the other easily, it may just 
be misleading…)

Also note that, according to C++, DATE and TIME and TIMESTAMP are exactly the 
same, but the TIME as as 32-bit number, could only hold about 2 years due to 
limited range.

Also, according to SQL, DATE has no time zone, it is just a date. That is, 
2016-03-13 is the same date in PST or GMT. If DATE were seconds since the UTC 
epoch, dates would be different in different time zones. So, I assume we use 
the Unix epoch, but without an implied UTC time zone as is usual for Linux and 
Windows timestamps?

How does a TIMESTAMP differ from a DATE? Perhaps a TIMESTAMP is based on the 
epoch UTC while DATE has no implied time zone?

Again, the documentation differs:

INTERVAL (Internally, INTERVAL is represented as INTERVALDAY or INTERVALYEAR.) 
A day-time or year-month interval

TIMESTAMP JDBC timestamp in year, month, date hour, minute, second, and 
optional milliseconds format: -MM-dd HH:mm:ss.SSS

So, sounds like we have an INTERVALDAY and INTERVAL year, but do we or do we 
not have an INTERVAL?

If anyone knows, please let me know, else I need to do some poking around...

Thanks,

- Paul

On Mar 13, 2017, at 2:44 PM, Parth Chandra 
mailto:par...@apache.org>> wrote:

Paul asked this and I'm posting here so someone who knows better can
correct me if I'm wrong ( This is from my notes when I was young)

DATE : Int64 : Milliseconds from Unix Epoch : 1/1/1970 00:00:00
TIME : Int32 : Milliseconds from midnight on 1/1/1970
TimeStampTZ : Int64 + Int32 : (Milliseconds from epoch + Index into list of
TimeZones)
TimeStamp : Int64 : Milliseconds from epoch
Interval : Int32 + Int32 + Int32 : Month + Days + Milliseconds
Interval Day : Int32 + Int32 : Days + Milliseconds
Interval Year : Int32 : Month

A slightly readable version of these can be found in the C++ client :).
$drill_src/contrib/native/client/src/include/drill/recordbatch.hpp which
has a bunch of 'Holder' structs for the date-time types.

HTH

Parth

[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...

2017-03-13 Thread parthchandra

GitHub user parthchandra opened a pull request:

https://github.com/apache/drill/pull/781

DRILL-5351: Minimize bounds checking in var len vectors for Parquet

reader

Two changes in var len vectors: 
1) Instead of checking to see if we need to realloc for every setSafe call, 
let the write fail and catch the exception. The exception, though expensive, 
will happen very rarely.
2) Call fillEmpties only if there are empty values to fill. 
This saves a bunch of CPU on every setSafe call.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/parthchandra/drill DRILL-5351

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/781.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #781


commit 57869496526a43351575d0f4879d2ac28fe973d4
Author: Parth Chandra 
Date:   2017-02-11T01:40:25Z

DRILL-5351: Minimize bounds checking in var len vectors for Parquet
reader




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Drill date & time types encoding

2017-03-13 Thread Parth Chandra

Paul asked this and I'm posting here so someone who knows better can
correct me if I'm wrong ( This is from my notes when I was young)

DATE : Int64 : Milliseconds from Unix Epoch : 1/1/1970 00:00:00
TIME : Int32 : Milliseconds from midnight on 1/1/1970
TimeStampTZ : Int64 + Int32 : (Milliseconds from epoch + Index into list of
TimeZones)
TimeStamp : Int64 : Milliseconds from epoch
Interval : Int32 + Int32 + Int32 : Month + Days + Milliseconds
Interval Day : Int32 + Int32 : Days + Milliseconds
Interval Year : Int32 : Month

A slightly readable version of these can be found in the C++ client :).
$drill_src/contrib/native/client/src/include/drill/recordbatch.hpp which
has a bunch of 'Holder' structs for the date-time types.

HTH

Parth

[RESULT] [VOTE] Release Apache Drill 1.10.0 rc0

2017-03-13 Thread Jinfeng Ni

The vote passes. Thanks to everyone who has tested the release
candidate and given
their comments and votes. Final tally:

3x +1 (binding): Aman, Parth, Jinfeng

2x +1 (non-binding): Arina, Gautam

No 0s or -1s.

I'll push the release artifacts and send an announcement once propagated.


Thanks,

Jinfeng

[jira] [Created] (DRILL-5351) Excessive bounds checking in the Parquet reader

2017-03-13 Thread Parth Chandra (JIRA)

Parth Chandra created DRILL-5351:


 Summary: Excessive bounds checking in the Parquet reader 
 Key: DRILL-5351
 URL: https://issues.apache.org/jira/browse/DRILL-5351
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Parth Chandra


In profiling the Parquet reader, the variable length decoding appears to be a 
major bottleneck making the reader CPU bound rather than disk bound.
A yourkit profile indicates the following methods being severe bottlenecks -

VarLenBinaryReader.determineSizeSerial(long)
  NullableVarBinaryVector$Mutator.setSafe(int, int, int, int, DrillBuf)
  DrillBuf.chk(int, int)
  NullableVarBinaryVector$Mutator.fillEmpties()

The problem is that each of these methods does some form of bounds checking and 
eventually of course, the actual write to the ByteBuf is also bounds checked.

DrillBuf.chk can be disabled by a configuration setting. Disabling this does 
improve performance of TPCH queries. In addition, all regression, unit, and 
TPCH-SF100 tests pass. 

I would recommend we allow users to turn this check off if there are 
performance critical queries.

Removing the bounds checking at every level is going to be a fair amount of 
work. In the meantime, it appears that a few simple changes to variable length 
vectors improves query performance by about 10% across the board. 





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[GitHub] drill pull request #780: DRILL-5349: Fix TestParquetWriter unit tests when s...

2017-03-13 Thread parthchandra

GitHub user parthchandra opened a pull request:

https://github.com/apache/drill/pull/780

DRILL-5349: Fix TestParquetWriter unit tests when synchronous parquetâ¦

â¦ reader is used.

Seems like I removed some lines from the code that I should not have. This 
PR reinstates them.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/parthchandra/drill DRILL-5349

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/780.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #780


commit 65631ddd9ba2446f6cb07921c3f1740bd43f63f9
Author: Parth Chandra 
Date:   2017-03-10T22:38:30Z

DRILL-5349: Fix TestParquetWriter unit tests when synchronous parquet 
reader is used.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #777: DRILL-5330: NPE in FunctionImplementationRegistry

2017-03-13 Thread arina-ielchiieva

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/777#discussion_r105628751
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -160,7 +168,7 @@ public DrillFuncHolder 
findDrillFunction(FunctionResolver functionResolver, Func
 FunctionResolver exactResolver = 
FunctionResolverFactory.getExactResolver(functionCall);
 DrillFuncHolder holder = exactResolver.getBestMatch(functions, 
functionCall);
 
-if (holder == null) {
+if (holder == null && useDynamicUdfs) {
--- End diff --

1. Since you have mentioned I remembered one more issue with 
FunctionImplementationRegistry, it can access only system options, so using 
`ExecConstants.USE_DYNAMIC_UDFS` won't work properly since it can be set at 
session level as well. I guess using bootsrap option you introduced is OK for 
now. Regarding your suggestion to have single option OFF, READ_ONLY and ON to 
handle the various cases (I love this idea!), we can try to implement this the 
scope of MVCC (I'll add this point to the document).

2. Even with boostrap option we need to update `findDrillFunction` to use 
provided function resolver when dynamic udfs are turned off (more details in my 
first comment). For example, `findDrillFunction` should can be re-written the 
following way (please optimize if needed):
```java
public DrillFuncHolder findDrillFunction(FunctionResolver functionResolver, 
FunctionCall functionCall) {
AtomicLong version = new AtomicLong();
String newFunctionName = functionReplacement(functionCall);
List functions = 
localFunctionRegistry.getMethods(newFunctionName, version);
if (!useDynamicUdfs) {
   return functionResolver.getBestMatch(functions, functionCall);
}
FunctionResolver exactResolver = 
FunctionResolverFactory.getExactResolver(functionCall);
DrillFuncHolder holder = exactResolver.getBestMatch(functions, 
functionCall);

if (holder == null) {
  syncWithRemoteRegistry(version.get());
  List updatedFunctions = 
localFunctionRegistry.getMethods(newFunctionName, version);
  holder = functionResolver.getBestMatch(updatedFunctions, 
functionCall);
}

return holder;
  }
```
3. Also changes should be done in `findExactMatchingDrillFunction` method 
to take into account boostrap option as well. For example (please optimize if 
needed):
```java
  public DrillFuncHolder findExactMatchingDrillFunction(String name, 
List argTypes, MajorType returnType) {
if (useDynamicUdfs) {
   return findExactMatchingDrillFunction(name, argTypes, returnType, 
true);
}
return findExactMatchingDrillFunction(name, argTypes, returnType, 
false);
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #779: Indexr 0.3.0 drill 1.9.0

2017-03-13 Thread xsq0718

GitHub user xsq0718 opened a pull request:

https://github.com/apache/drill/pull/779

Indexr 0.3.0 drill 1.9.0



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shunfei/drill indexr-0.3.0-drill-1.9.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/779.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #779


commit ae0608b5c4b5ef9f897d6bc3a51f00f0b985bd60
Author: Sudheesh Katkam 
Date:   2016-11-11T23:37:35Z

Revert "DRILL-4373: Drill and Hive have incompatible timestamp 
representations in parquet - added sys/sess option 
"store.parquet.int96_as_timestamp"; - added int96 to timestamp converter for 
both readers; - added unit tests;"

This reverts commit 7e7214b40784668d1599f265067f789aedb6cf86.

commit 4312d65bd5e0f68dc963ed722d0cdfd2628ea5f5
Author: Sudheesh Katkam 
Date:   2016-11-18T19:44:30Z

[maven-release-plugin] prepare release drill-1.9.0

commit ab0648e06c0c65f56f82335526940a6b40c9218a
Author: flow 
Date:   2017-01-05T09:38:38Z

Add IndexR plugin.

IndexR project: https://github.com/shunfei/indexr

commit 05df10ddaaf5d3b921b3fad73cba7a1f94689d66
Author: flow 
Date:   2017-01-05T12:49:19Z

IndexR plugin: fix code style check error, support java 7

commit a43a62bca7121a85495abec946d37ca4a3b5516e
Author: flow 
Date:   2017-01-06T02:45:56Z

IndexR plugin: fix plugin version

commit 3481e35c91ead39ea430a7cb9c58b40429a1d9d8
Author: flow 
Date:   2017-01-22T02:51:09Z

upgrate indexr version to 0.2.0

commit 5b06a0d0b636a4b33551b8a3adb1b843d5407ae7
Author: flow 
Date:   2017-01-24T02:31:53Z

IndexR plugin bug fix: column name should be compared ignore case

commit eaec28f4be399ace65ee9ef6df1e7f5239f952bc
Author: flow 
Date:   2017-02-08T09:21:07Z

try throw column not found with segment name

commit ff3c68a047833b017dc1c02043e5ede437286dfc
Author: flow 
Date:   2017-02-16T07:57:27Z

UPDATE API: using SQLType

commit 483bffbe7aaee156438412ac0a18d967b587dae6
Author: flow 
Date:   2017-02-16T08:43:18Z

fix consume time issue

commit 16f60ec3687fb5969c20b51c2380d3f582684fc3
Author: flow 
Date:   2017-02-20T02:39:18Z

update indexr version to 0.2.1

commit d81eae753e38b006ef4be497bc9a77beb871bea8
Author: flow 
Date:   2017-02-20T02:47:29Z

update version to 0.3.0-SNAPSHOT

commit e80fb99724d3c48d27b83d87d9678ee2ec71f994
Author: flow 
Date:   2017-02-22T03:00:04Z

Update API: rsFilter#roughCheckOnRow

commit 54d5d6696fa1c4fac7a27d8342cc7186cb848abf
Author: flow 
Date:   2017-02-28T08:17:11Z

use LM when contains string fields

commit a49019726de0a881803941c76082d3affc9d7c39
Author: flow 
Date:   2017-03-06T10:36:47Z

update indexr.version to 0.3.0




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (DRILL-5353) Merge "Project on Project" generated in physical plan stage

[GitHub] drill pull request #783: DRILL-5324: Provide simplified column reader/writer...

[GitHub] drill pull request #782: DRILL-5352: Profile parser printing for multi fragm...

[jira] [Created] (DRILL-5352) Extend test framework profile parser printer for multi-fragment queries

[GitHub] drill pull request #777: DRILL-5330: NPE in FunctionImplementationRegistry

Re: Drill date & time types encoding

[GitHub] drill pull request #781: DRILL-5351: Minimize bounds checking in var len vec...

Drill date & time types encoding

[RESULT] [VOTE] Release Apache Drill 1.10.0 rc0

[jira] [Created] (DRILL-5351) Excessive bounds checking in the Parquet reader

[GitHub] drill pull request #780: DRILL-5349: Fix TestParquetWriter unit tests when s...

[GitHub] drill pull request #777: DRILL-5330: NPE in FunctionImplementationRegistry

[GitHub] drill pull request #779: Indexr 0.3.0 drill 1.9.0

13 matches

Site Navigation

Mail list logo

Footer information