[UDF] How do I return NULL

2015-10-06 Thread Tugdual Grall
Hello Drillers,

I am developing a custom function and I would like to return NULL (based on
the value, for example if the varchar is '' I want my function to return
NULL)

I have not found the way to do it.


Regards
Tug
@tgrall


[jira] [Created] (DRILL-3897) Partitions not being pruned

2015-10-06 Thread Nathaniel Auvil (JIRA)
Nathaniel Auvil created DRILL-3897:
--

 Summary: Partitions not being pruned
 Key: DRILL-3897
 URL: https://issues.apache.org/jira/browse/DRILL-3897
 Project: Apache Drill
  Issue Type: Bug
Reporter: Nathaniel Auvil


have a two deep partitioning structure.  Drill is not pruning partitions 
correctly as it reads all files under every directory.  My source files are tab 
delimited files.  

My query:
select dir0 server, dir1 dayId,  max(LENGTH(columns[2])) maxSize from 
dfs.`/archive/psn` where dir1 >= 20151001 group by dir0,dir1 order by maxSize

plan snippet showing Drill reading uneeded files:


00-00Screen : rowType = RecordType(ANY server, ANY dayId, ANY maxSize): 
rowcount = 5.177268921894E8, cumulative cost = {4.898214127009591E11 rows, 
3.373451719812133E12 cpu, 0.0 io, 9.966863946928127E13 network, 
1.51590434033232E12 memory}, id = 44973
00-01  Project(server=[$0], dayId=[$1], maxSize=[$2]) : rowType = 
RecordType(ANY server, ANY dayId, ANY maxSize): rowcount = 
5.177268921894E8, cumulative cost = {4.897696400117401E11 rows, 
3.3733999471229136E12 cpu, 0.0 io, 9.966863946928127E13 network, 
1.51590434033232E12 memory}, id = 44972
00-02SingleMergeExchange(sort0=[2 ASC]) : rowType = RecordType(ANY 
server, ANY dayId, ANY maxSize): rowcount = 5.177268921894E8, cumulative 
cost = {4.897696400117401E11 rows, 3.3733999471229136E12 cpu, 0.0 io, 
9.966863946928127E13 network, 1.51590434033232E12 memory}, id = 44971
01-01  SelectionVectorRemover : rowType = RecordType(ANY server, ANY 
dayId, ANY maxSize): rowcount = 5.177268921894E8, cumulative cost = 
{4.892519131195501E11 rows, 3.3589035941415938E12 cpu, 0.0 io, 
9.330681141805055E13 network, 1.51590434033232E12 memory}, id = 44970
01-02Sort(sort0=[$2], dir0=[ASC]) : rowType = RecordType(ANY 
server, ANY dayId, ANY maxSize): rowcount = 5.177268921894E8, cumulative 
cost = {4.887341862273601E11 rows, 3.358385867249404E12 cpu, 0.0 io, 
9.330681141805055E13 network, 1.51590434033232E12 memory}, id = 44969
01-03  Project(server=[$0], dayId=[$1], maxSize=[$2]) : rowType = 
RecordType(ANY server, ANY dayId, ANY maxSize): rowcount = 
5.177268921894E8, cumulative cost = {4.882164593351701E11 rows, 
3.2984380301424897E12 cpu, 0.0 io, 9.330681141805055E13 network, 
1.50347889491976E12 memory}, id = 44968
01-04HashToRandomExchange(dist0=[[$2]]) : rowType = 
RecordType(ANY server, ANY dayId, ANY maxSize, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): 
rowcount = 5.177268921894E8, cumulative cost = {4.882164593351701E11 rows, 
3.2984380301424897E12 cpu, 0.0 io, 9.330681141805055E13 network, 
1.50347889491976E12 memory}, id = 44967
02-01  UnorderedMuxExchange : rowType = RecordType(ANY server, 
ANY dayId, ANY maxSize, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 
5.177268921894E8, cumulative cost = {4.876987324429801E11 rows, 
3.2901543998674497E12 cpu, 0.0 io, 8.48243740164096E13 network, 
1.50347889491976E12 memory}, id = 44966
03-01Project(server=[$0], dayId=[$1], maxSize=[$2], 
E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($2))]) : rowType = 
RecordType(ANY server, ANY dayId, ANY maxSize, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): 
rowcount = 5.177268921894E8, cumulative cost = {4.871810055507901E11 rows, 
3.28963667297526E12 cpu, 0.0 io, 8.48243740164096E13 network, 
1.50347889491976E12 memory}, id = 44965
03-02  HashAgg(group=[{0, 1}], maxSize=[MAX($2)]) : rowType 
= RecordType(ANY server, ANY dayId, ANY maxSize): rowcount = 
5.177268921894E8, cumulative cost = {4.866632786586001E11 rows, 
3.2875657654065E12 cpu, 0.0 io, 8.48243740164096E13 network, 
1.50347889491976E12 memory}, id = 44964
03-03Project(server=[$0], dayId=[$1], maxSize=[$2]) : 
rowType = RecordType(ANY server, ANY dayId, ANY maxSize): rowcount = 
5.1772689219E9, cumulative cost = {4.814860097367001E11 rows, 
3.1426022355933E12 cpu, 0.0 io, 8.48243740164096E13 network, 1.3667989953816E12 
memory}, id = 44963
03-04  HashToRandomExchange(dist0=[[$0]], dist1=[[$1]]) 
: rowType = RecordType(ANY server, ANY dayId, ANY maxSize, ANY 
E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 5.1772689219E9, cumulative cost = 
{4.814860097367001E11 rows, 3.1426022355933E12 cpu, 0.0 io, 8.48243740164096E13 
network, 1.3667989953816E12 memory}, id = 44962
04-01UnorderedMuxExchange : rowType = 
RecordType(ANY server, ANY dayId, ANY maxSize, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): 
rowcount = 5.1772689219E9, cumulative cost = {4.7630874081480005E11 rows, 
3.0804750085305E12 cpu, 0.0 io, 0.0 network, 1.3667989953816E12 memory}, id = 
44961
05-01  Project(server=[$0], dayId=[$1], 
maxSize=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1, 
hash64AsDouble($0)))]) : rowType = RecordTyp

Re: [UDF] How do I return NULL

2015-10-06 Thread Abdel Hakim Deneche
Hi Tug,

Let's say your UDF returns an int, your @output field will be defined like
this:

@Output NullableIntHolder out;


To return a NULL you just have to set:

out.isSet = 0;


Thanks

On Tue, Oct 6, 2015 at 1:56 AM, Tugdual Grall  wrote:

> Hello Drillers,
>
> I am developing a custom function and I would like to return NULL (based on
> the value, for example if the varchar is '' I want my function to return
> NULL)
>
> I have not found the way to do it.
>
>
> Regards
> Tug
> @tgrall
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: [VOTE] Release Apache Drill 1.2.0 (rc0)

2015-10-06 Thread Abdel Hakim Deneche
verified the artifacts checksums and that they are signed by my gpg key.
Built Drill from source in MacOS and CentOS and both builds were successful
and all unit tests passed. Run some window functions queries and everything
seems fine.

+1 (binding)

On Mon, Oct 5, 2015 at 1:59 PM, Abdel Hakim Deneche 
wrote:

> Aman, I used JIRA release notes generator. It includes all JIRAs marked
> "Fix for" 1.2.0. I guess we just need to move all JIRAs still open and
> marked as 1.2.0 to 1.3.0 or Future.
>
> On Mon, Oct 5, 2015 at 1:54 PM, Aman Sinha  wrote:
>
>> I see the following in the release notes:  this is not supported yet.  Are
>> you using the correct 'status' condition in your query ?
>>
>>- [DRILL-3534 ] -
>>Insert into table support
>>
>>
>> On Mon, Oct 5, 2015 at 1:16 PM, Abdel Hakim Deneche <
>> adene...@maprtech.com>
>> wrote:
>>
>> > One precision, the commit that should show up in the release is the
>> > following:
>> >
>> > b418397790e7e00505846d48bc6458d710c00095
>> > upgrading maven-release plugin to fix release issues
>> >
>> > master has already moved past that commit
>> >
>> > thanks
>> >
>> > On Mon, Oct 5, 2015 at 11:00 AM, Abdel Hakim Deneche <
>> > adene...@maprtech.com>
>> > wrote:
>> >
>> > > Hey all,
>> > >
>> > > I'm happy to propose a new release of Apache Drill, version 1.2.0.
>> This
>> > is
>> > > the first release candidate (rc0).
>> > >
>> > > Thanks to everyone who contributed to this release, we have more than
>> 200
>> > > closed and resolved JIRAs
>> > > [1].
>> > >
>> > > The tarball artifacts are hosted at [2] and the maven artifacts (new
>> for
>> > > this release) are hosted at [3].
>> > >
>> > > The vote will be open for the next 72 hours ending at 11AM Pacific,
>> > > October 8, 2015.
>> > >
>> > > [ ] +1
>> > > [ ] +0
>> > > [ ] -1
>> > >
>> > > thanks,
>> > > Hakim
>> > >
>> > > [1]
>> > >
>> >
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12332042&projectId=12313820
>> > > [2] http://people.apache.org/~adeneche/apache-drill-1.2.0-rc0/
>> > > [3]
>> > https://repository.apache.org/content/repositories/orgapachedrill-1004
>> > > --
>> > >
>> > > Abdelhakim Deneche
>> > >
>> > > Software Engineer
>> > >
>> > >   
>> > >
>> > >
>> > > Now Available - Free Hadoop On-Demand Training
>> > > <
>> >
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> > >
>> > >
>> >
>> >
>> >
>> > --
>> >
>> > Abdelhakim Deneche
>> >
>> > Software Engineer
>> >
>> >   
>> >
>> >
>> > Now Available - Free Hadoop On-Demand Training
>> > <
>> >
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> > >
>> >
>>
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> 
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: [VOTE] Release Apache Drill 1.2.0 (rc0)

2015-10-06 Thread Edmon Begoli
Humbly, +1.

On Tue, Oct 6, 2015 at 12:32 PM, Abdel Hakim Deneche 
wrote:

> verified the artifacts checksums and that they are signed by my gpg key.
> Built Drill from source in MacOS and CentOS and both builds were successful
> and all unit tests passed. Run some window functions queries and everything
> seems fine.
>
> +1 (binding)
>
> On Mon, Oct 5, 2015 at 1:59 PM, Abdel Hakim Deneche  >
> wrote:
>
> > Aman, I used JIRA release notes generator. It includes all JIRAs marked
> > "Fix for" 1.2.0. I guess we just need to move all JIRAs still open and
> > marked as 1.2.0 to 1.3.0 or Future.
> >
> > On Mon, Oct 5, 2015 at 1:54 PM, Aman Sinha  wrote:
> >
> >> I see the following in the release notes:  this is not supported yet.
> Are
> >> you using the correct 'status' condition in your query ?
> >>
> >>- [DRILL-3534 ] -
> >>Insert into table support
> >>
> >>
> >> On Mon, Oct 5, 2015 at 1:16 PM, Abdel Hakim Deneche <
> >> adene...@maprtech.com>
> >> wrote:
> >>
> >> > One precision, the commit that should show up in the release is the
> >> > following:
> >> >
> >> > b418397790e7e00505846d48bc6458d710c00095
> >> > upgrading maven-release plugin to fix release issues
> >> >
> >> > master has already moved past that commit
> >> >
> >> > thanks
> >> >
> >> > On Mon, Oct 5, 2015 at 11:00 AM, Abdel Hakim Deneche <
> >> > adene...@maprtech.com>
> >> > wrote:
> >> >
> >> > > Hey all,
> >> > >
> >> > > I'm happy to propose a new release of Apache Drill, version 1.2.0.
> >> This
> >> > is
> >> > > the first release candidate (rc0).
> >> > >
> >> > > Thanks to everyone who contributed to this release, we have more
> than
> >> 200
> >> > > closed and resolved JIRAs
> >> > > [1].
> >> > >
> >> > > The tarball artifacts are hosted at [2] and the maven artifacts (new
> >> for
> >> > > this release) are hosted at [3].
> >> > >
> >> > > The vote will be open for the next 72 hours ending at 11AM Pacific,
> >> > > October 8, 2015.
> >> > >
> >> > > [ ] +1
> >> > > [ ] +0
> >> > > [ ] -1
> >> > >
> >> > > thanks,
> >> > > Hakim
> >> > >
> >> > > [1]
> >> > >
> >> >
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12332042&projectId=12313820
> >> > > [2] http://people.apache.org/~adeneche/apache-drill-1.2.0-rc0/
> >> > > [3]
> >> >
> https://repository.apache.org/content/repositories/orgapachedrill-1004
> >> > > --
> >> > >
> >> > > Abdelhakim Deneche
> >> > >
> >> > > Software Engineer
> >> > >
> >> > >   
> >> > >
> >> > >
> >> > > Now Available - Free Hadoop On-Demand Training
> >> > > <
> >> >
> >>
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >> > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > Abdelhakim Deneche
> >> >
> >> > Software Engineer
> >> >
> >> >   
> >> >
> >> >
> >> > Now Available - Free Hadoop On-Demand Training
> >> > <
> >> >
> >>
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >> > >
> >> >
> >>
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>


Re: Drill Hangout starting now

2015-10-06 Thread Parth Chandra
Join us here:
> https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
>


[jira] [Created] (DRILL-3898) NPE during external sort when there is not enough space for spilling

2015-10-06 Thread Victoria Markman (JIRA)
Victoria Markman created DRILL-3898:
---

 Summary: NPE during external sort when there is not enough space 
for spilling
 Key: DRILL-3898
 URL: https://issues.apache.org/jira/browse/DRILL-3898
 Project: Apache Drill
  Issue Type: Bug
Reporter: Victoria Markman


While verifying DRILL-3732 I ran into a new problem.
I think drill somehow loses track of out of disk exception and does not cancel 
rest of the query, which results in NPE:

Reproduction is the same as in DRILL-3732:
{code}
0: jdbc:drill:schema=dfs> create table store_sales_20(ss_item_sk, 
ss_customer_sk, ss_cdemo_sk, ss_hdemo_sk, s_sold_date_sk, ss_promo_sk) 
partition by (ss_promo_sk) as
. . . . . . . . . . . . >  select 
. . . . . . . . . . . . >  case when columns[2] = '' then cast(null as 
varchar(100)) else cast(columns[2] as varchar(100)) end,
. . . . . . . . . . . . >  case when columns[3] = '' then cast(null as 
varchar(100)) else cast(columns[3] as varchar(100)) end,
. . . . . . . . . . . . >  case when columns[4] = '' then cast(null as 
varchar(100)) else cast(columns[4] as varchar(100)) end, 
. . . . . . . . . . . . >  case when columns[5] = '' then cast(null as 
varchar(100)) else cast(columns[5] as varchar(100)) end, 
. . . . . . . . . . . . >  case when columns[0] = '' then cast(null as 
varchar(100)) else cast(columns[0] as varchar(100)) end, 
. . . . . . . . . . . . >  case when columns[8] = '' then cast(null as 
varchar(100)) else cast(columns[8] as varchar(100)) end
. . . . . . . . . . . . >  from 
. . . . . . . . . . . . >   `store_sales.dat` ss 
. . . . . . . . . . . . > ;
Error: SYSTEM ERROR: NullPointerException
Fragment 1:16
[Error Id: 0ae9338d-d04f-4b4a-93aa-a80d13cedb29 on atsqa4-133.qa.lab:31010] 
(state=,code=0)
{code}

This exception in drillbit.log should have triggered query cancellation:
{code}
2015-10-06 17:01:34,463 [WorkManager-2] ERROR 
o.apache.drill.exec.work.WorkManager - 
org.apache.drill.exec.work.WorkManager$WorkerBee$1.run() leaked an exception.
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:226)
 ~[hadoop-common-2.5.1-mapr-1503.jar:na]
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) 
~[na:1.7.0_71]
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
~[na:1.7.0_71]
at java.io.FilterOutputStream.close(FilterOutputStream.java:157) 
~[na:1.7.0_71]
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
 ~[hadoop-common-2.5.1-mapr-1503.jar:na]
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) 
~[hadoop-common-2.5.1-mapr-1503.jar:na]
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:400)
 ~[hadoop-common-2.5.1-mapr-1503.jar:na]
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
 ~[hadoop-common-2.5.1-mapr-1503.jar:na]
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) 
~[hadoop-common-2.5.1-mapr-1503.jar:na]
at 
org.apache.drill.exec.physical.impl.xsort.BatchGroup.close(BatchGroup.java:152) 
~[drill-java-exec-1.2.0.jar:1.2.0]
at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:44) 
~[drill-common-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:553)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.innerNext(ExternalSortBatch.java:362)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:147)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:104)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:94)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:94)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:147)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:104)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:94)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.record.Abs

[jira] [Created] (DRILL-3899) SplitUpComplexExpressions rule should be enhanced to avoid planning unnecessary copies of data

2015-10-06 Thread Jason Altekruse (JIRA)
Jason Altekruse created DRILL-3899:
--

 Summary: SplitUpComplexExpressions rule should be enhanced to 
avoid planning unnecessary copies of data
 Key: DRILL-3899
 URL: https://issues.apache.org/jira/browse/DRILL-3899
 Project: Apache Drill
  Issue Type: Bug
Reporter: Jason Altekruse


A small enhancement was made as part of DRILL-3876 to remove an unnecessary 
copy in a simple flatten case. This was easy to implement, but did not cover 
all of the possible cases where the rewrite rule is currently planning 
inefficient operations. This issue is tracking the more complete fix to handle 
all of the more complex cases optimally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-3900) OOM with Hive native scan enabled on TPCH-100 parquet, query 05.q

2015-10-06 Thread Chun Chang (JIRA)
Chun Chang created DRILL-3900:
-

 Summary: OOM with Hive native scan enabled on TPCH-100 parquet, 
query 05.q
 Key: DRILL-3900
 URL: https://issues.apache.org/jira/browse/DRILL-3900
 Project: Apache Drill
  Issue Type: Bug
  Components: Functions - Hive
Affects Versions: 1.2.0
Reporter: Chun Chang


TPCH-100 parquet dataset. Configure Hive 1.0 pointing to the parquet files as 
external tables. Enable Hive native scan.

{noformat}
alter system set `store.hive.optimize_scan_with_native_readers`=true;
{noformat}

Run TPCH query 05 through Hive, drillbit runs out of memory. Same query goes 
through dfs completes successfully. (Disable hive native scan, drill also runs 
out of memory through hive.)

We expect with hive native scan turned on, query should finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request: DRILL-3876: Avoid an extra copy of the origina...

2015-10-06 Thread jaltekruse
GitHub user jaltekruse opened a pull request:

https://github.com/apache/drill/pull/187

DRILL-3876: Avoid an extra copy of the original list when flattening

This only fixes a basic case, a more complete refactoring of the rewrite 
rule could avoid copies in cases with multiple flattens, this will be addressed 
in DRILL-3899.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jaltekruse/incubator-drill 
3876-fix-flatten-simple-extra-copy

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/187.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #187


commit da864a6b0307712166fc02842cb2aa8ce46e36b5
Author: Jason Altekruse 
Date:   2015-10-06T16:23:18Z

DRILL-3876: Avoid an extra copy of the original list when flattening

This only fixes a basic case, a more complete refactoring of the rewrite 
rule could avoid copies in cases with multiple flattens, this will be addressed 
in DRILL-3899.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Partial aggregation in Drill-on-Phoenix

2015-10-06 Thread Maryann Xue
The partial aggregate seems to be working now, with one interface extension
and one bug fix in the Phoenix project. Will do some code cleanup and
create a pull request soon.

Still there was a hack in the Drill project which I made to force 2-phase
aggregation. I'll try to fix that.

Jacques, I have one question though, how can I verify that there are more
than one slice and the shuffle happens?


Thanks,
Maryann

On Mon, Oct 5, 2015 at 2:03 PM, James Taylor  wrote:

> Maryann,
> I believe Jacques mentioned that a little bit of refactoring is required
> for a merge sort to occur - there's something that does that, but it's not
> expected to be used in this context currently.
>
> IMHO, there's more of a clear value in getting the aggregation to use
> Phoenix first, so I'd recommend going down that road as Jacques mentioned
> above if possible. Once that's working, we can circle back to the partial
> sort.
>
> Thoughts?
> James
>
> On Mon, Oct 5, 2015 at 10:40 AM, Maryann Xue 
> wrote:
>
>> I actually tried implementing partial sort with
>> https://github.com/jacques-n/drill/pull/4, which I figured might be a
>> little easier to start with than partial aggregation. But I found that even
>> though the code worked (returned the right results), the Drill side sort
>> turned out to be a ordinary sort instead of a merge which it should have
>> been. Any idea of how to fix that?
>>
>>
>> Thanks,
>> Maryann
>>
>> On Mon, Oct 5, 2015 at 12:52 PM, Jacques Nadeau 
>> wrote:
>>
>>> Right now this type of work is done here:
>>>
>>>
>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/HashAggPrule.java
>>>
>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/AggPruleBase.java
>>>
>>> With Distribution Trait application here:
>>>
>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/DrillDistributionTraitDef.java
>>>
>>> To me, the easiest way to solve the Phoenix issue is by providing a rule
>>> that matches HashAgg and StreamAgg but requires Phoenix convention as
>>> input. It would replace everywhere but would only be plannable when it is
>>> the first phase of aggregation.
>>>
>>> Thoughts?
>>>
>>>
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>> On Thu, Oct 1, 2015 at 2:30 PM, Julian Hyde  wrote:
>>>
 Phoenix is able to perform quite a few relational operations on the
 region server: scan, filter, project, aggregate, sort (optionally with
 limit). However, the sort and aggregate are necessarily "local". They
 can only deal with data on that region server, and there needs to be a
 further operation to combine the results from the region servers.

 The question is how to plan such queries. I think the answer is an
 AggregateExchangeTransposeRule.

 The rule would spot an Aggregate on a data source that is split into
 multiple locations (partitions) and split it into a partial Aggregate
 that computes sub-totals and a summarizing Aggregate that combines
 those totals.

 How does the planner know that the Aggregate needs to be split? Since
 the data's distribution has changed, there would need to be an
 Exchange operator. It is the Exchange operator that triggers the rule
 to fire.

 There are some special cases. If the data is sorted as well as
 partitioned (say because the local aggregate uses a sort-based
 algorithm) we could maybe use a more efficient plan. And if the
 partition key is the same as the aggregation key we don't need a
 summarizing Aggregate, just a Union.

 It turns out not to be very Phoenix-specific. In the Drill-on-Phoenix
 scenario, once the Aggregate has been pushed through the Exchange
 (i.e. onto the drill-bit residing on the region server) we can then
 push the DrillAggregate across the drill-to-phoenix membrane and make
 it into a PhoenixServerAggregate that executes in the region server.

 Related issues:
 * https://issues.apache.org/jira/browse/DRILL-3840
 * https://issues.apache.org/jira/browse/CALCITE-751

 Julian

>>>
>>>
>>
>


[jira] [Created] (DRILL-3901) Performance regression with doing Explain of COUNT(*) over 100K files

2015-10-06 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-3901:
-

 Summary: Performance regression with doing Explain of COUNT(*) 
over 100K files
 Key: DRILL-3901
 URL: https://issues.apache.org/jira/browse/DRILL-3901
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Aman Sinha


We are seeing a performance regression when doing an Explain of SELECT COUNT(*) 
over 100K files in a flat directory (no subdirectories) on latest master branch 
compared to a run that was done on Sept 26.   Some initial details (I will have 
more later): 

{code}
master branch on Sept 26
   No metadata cache: 71.452 secs
   With metadata cache: 15.804 secs

Latest master branch 
   No metadata cache: 110 secs
   With metadata cache: 32 secs
{code}

So, both cases show regression.  

[~mehant] and I took an initial look at this and it appears we might be doing 
the directory expansion twice.  
   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Partial aggregation in Drill-on-Phoenix

2015-10-06 Thread Julian Hyde
Drill's current approach seems adequate for Drill alone but extending
it to a heterogenous system that includes Phoenix seems like a hack.

I think you should only create Prels for algebra nodes that you know
for sure are going to run on the Drill engine. If there's a
possibility that it would run in another engine such as Phoenix then
they should still be logical.

On Tue, Oct 6, 2015 at 11:03 AM, Maryann Xue  wrote:
> The partial aggregate seems to be working now, with one interface extension
> and one bug fix in the Phoenix project. Will do some code cleanup and
> create a pull request soon.
>
> Still there was a hack in the Drill project which I made to force 2-phase
> aggregation. I'll try to fix that.
>
> Jacques, I have one question though, how can I verify that there are more
> than one slice and the shuffle happens?
>
>
> Thanks,
> Maryann
>
> On Mon, Oct 5, 2015 at 2:03 PM, James Taylor  wrote:
>
>> Maryann,
>> I believe Jacques mentioned that a little bit of refactoring is required
>> for a merge sort to occur - there's something that does that, but it's not
>> expected to be used in this context currently.
>>
>> IMHO, there's more of a clear value in getting the aggregation to use
>> Phoenix first, so I'd recommend going down that road as Jacques mentioned
>> above if possible. Once that's working, we can circle back to the partial
>> sort.
>>
>> Thoughts?
>> James
>>
>> On Mon, Oct 5, 2015 at 10:40 AM, Maryann Xue 
>> wrote:
>>
>>> I actually tried implementing partial sort with
>>> https://github.com/jacques-n/drill/pull/4, which I figured might be a
>>> little easier to start with than partial aggregation. But I found that even
>>> though the code worked (returned the right results), the Drill side sort
>>> turned out to be a ordinary sort instead of a merge which it should have
>>> been. Any idea of how to fix that?
>>>
>>>
>>> Thanks,
>>> Maryann
>>>
>>> On Mon, Oct 5, 2015 at 12:52 PM, Jacques Nadeau 
>>> wrote:
>>>
 Right now this type of work is done here:


 https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/HashAggPrule.java

 https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/AggPruleBase.java

 With Distribution Trait application here:

 https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/DrillDistributionTraitDef.java

 To me, the easiest way to solve the Phoenix issue is by providing a rule
 that matches HashAgg and StreamAgg but requires Phoenix convention as
 input. It would replace everywhere but would only be plannable when it is
 the first phase of aggregation.

 Thoughts?



 --
 Jacques Nadeau
 CTO and Co-Founder, Dremio

 On Thu, Oct 1, 2015 at 2:30 PM, Julian Hyde  wrote:

> Phoenix is able to perform quite a few relational operations on the
> region server: scan, filter, project, aggregate, sort (optionally with
> limit). However, the sort and aggregate are necessarily "local". They
> can only deal with data on that region server, and there needs to be a
> further operation to combine the results from the region servers.
>
> The question is how to plan such queries. I think the answer is an
> AggregateExchangeTransposeRule.
>
> The rule would spot an Aggregate on a data source that is split into
> multiple locations (partitions) and split it into a partial Aggregate
> that computes sub-totals and a summarizing Aggregate that combines
> those totals.
>
> How does the planner know that the Aggregate needs to be split? Since
> the data's distribution has changed, there would need to be an
> Exchange operator. It is the Exchange operator that triggers the rule
> to fire.
>
> There are some special cases. If the data is sorted as well as
> partitioned (say because the local aggregate uses a sort-based
> algorithm) we could maybe use a more efficient plan. And if the
> partition key is the same as the aggregation key we don't need a
> summarizing Aggregate, just a Union.
>
> It turns out not to be very Phoenix-specific. In the Drill-on-Phoenix
> scenario, once the Aggregate has been pushed through the Exchange
> (i.e. onto the drill-bit residing on the region server) we can then
> push the DrillAggregate across the drill-to-phoenix membrane and make
> it into a PhoenixServerAggregate that executes in the region server.
>
> Related issues:
> * https://issues.apache.org/jira/browse/DRILL-3840
> * https://issues.apache.org/jira/browse/CALCITE-751
>
> Julian
>


>>>
>>


Re: [VOTE] Release Apache Drill 1.2.0 (rc0)

2015-10-06 Thread Aman Sinha
I have filed DRILL-3901 for a performance issue that we are trying to
address.  We can discuss whether to continue with the existing release
candidate or wait for a fix.

On Tue, Oct 6, 2015 at 9:38 AM, Edmon Begoli  wrote:

> Humbly, +1.
>
> On Tue, Oct 6, 2015 at 12:32 PM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > verified the artifacts checksums and that they are signed by my gpg key.
> > Built Drill from source in MacOS and CentOS and both builds were
> successful
> > and all unit tests passed. Run some window functions queries and
> everything
> > seems fine.
> >
> > +1 (binding)
> >
> > On Mon, Oct 5, 2015 at 1:59 PM, Abdel Hakim Deneche <
> adene...@maprtech.com
> > >
> > wrote:
> >
> > > Aman, I used JIRA release notes generator. It includes all JIRAs marked
> > > "Fix for" 1.2.0. I guess we just need to move all JIRAs still open and
> > > marked as 1.2.0 to 1.3.0 or Future.
> > >
> > > On Mon, Oct 5, 2015 at 1:54 PM, Aman Sinha 
> wrote:
> > >
> > >> I see the following in the release notes:  this is not supported yet.
> > Are
> > >> you using the correct 'status' condition in your query ?
> > >>
> > >>- [DRILL-3534 ]
> -
> > >>Insert into table support
> > >>
> > >>
> > >> On Mon, Oct 5, 2015 at 1:16 PM, Abdel Hakim Deneche <
> > >> adene...@maprtech.com>
> > >> wrote:
> > >>
> > >> > One precision, the commit that should show up in the release is the
> > >> > following:
> > >> >
> > >> > b418397790e7e00505846d48bc6458d710c00095
> > >> > upgrading maven-release plugin to fix release issues
> > >> >
> > >> > master has already moved past that commit
> > >> >
> > >> > thanks
> > >> >
> > >> > On Mon, Oct 5, 2015 at 11:00 AM, Abdel Hakim Deneche <
> > >> > adene...@maprtech.com>
> > >> > wrote:
> > >> >
> > >> > > Hey all,
> > >> > >
> > >> > > I'm happy to propose a new release of Apache Drill, version 1.2.0.
> > >> This
> > >> > is
> > >> > > the first release candidate (rc0).
> > >> > >
> > >> > > Thanks to everyone who contributed to this release, we have more
> > than
> > >> 200
> > >> > > closed and resolved JIRAs
> > >> > > [1].
> > >> > >
> > >> > > The tarball artifacts are hosted at [2] and the maven artifacts
> (new
> > >> for
> > >> > > this release) are hosted at [3].
> > >> > >
> > >> > > The vote will be open for the next 72 hours ending at 11AM
> Pacific,
> > >> > > October 8, 2015.
> > >> > >
> > >> > > [ ] +1
> > >> > > [ ] +0
> > >> > > [ ] -1
> > >> > >
> > >> > > thanks,
> > >> > > Hakim
> > >> > >
> > >> > > [1]
> > >> > >
> > >> >
> > >>
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12332042&projectId=12313820
> > >> > > [2] http://people.apache.org/~adeneche/apache-drill-1.2.0-rc0/
> > >> > > [3]
> > >> >
> > https://repository.apache.org/content/repositories/orgapachedrill-1004
> > >> > > --
> > >> > >
> > >> > > Abdelhakim Deneche
> > >> > >
> > >> > > Software Engineer
> > >> > >
> > >> > >   
> > >> > >
> > >> > >
> > >> > > Now Available - Free Hadoop On-Demand Training
> > >> > > <
> > >> >
> > >>
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >> > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> >
> > >> > Abdelhakim Deneche
> > >> >
> > >> > Software Engineer
> > >> >
> > >> >   
> > >> >
> > >> >
> > >> > Now Available - Free Hadoop On-Demand Training
> > >> > <
> > >> >
> > >>
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >> > >
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>


Re: [UDF] How do I return NULL

2015-10-06 Thread Steven Phillips
In addition, your UDF needs to have the attribute "nulls =
NullHandling.INTERNAL"

On Tue, Oct 6, 2015 at 8:32 AM, Abdel Hakim Deneche 
wrote:

> Hi Tug,
>
> Let's say your UDF returns an int, your @output field will be defined like
> this:
>
> @Output NullableIntHolder out;
>
>
> To return a NULL you just have to set:
>
> out.isSet = 0;
>
>
> Thanks
>
> On Tue, Oct 6, 2015 at 1:56 AM, Tugdual Grall  wrote:
>
> > Hello Drillers,
> >
> > I am developing a custom function and I would like to return NULL (based
> on
> > the value, for example if the varchar is '' I want my function to return
> > NULL)
> >
> > I have not found the way to do it.
> >
> >
> > Regards
> > Tug
> > @tgrall
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>


Re: Partial aggregation in Drill-on-Phoenix

2015-10-06 Thread Maryann Xue
Added a few fixes in the pull request. Tested with two regions, turned out
that half of the result is empty (count = 0).
Not sure if there's anything wrong with
https://github.com/maryannxue/drill/blob/phoenix_plugin/contrib/storage-phoenix/src/main/java/org/apache/drill/exec/store/phoenix/rel/PhoenixHashAggPrule.java
.
Like Julian said, this rule looks a bit hacky.

To force a 2-phase HashAgg, I made a temporary change as well:

diff --git
a/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/AggPruleBase.java
b/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/AggPruleBase.java

index b911f6b..58bc918 100644

---
a/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/AggPruleBase.java

+++
b/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/AggPruleBase.java

@@ -60,12 +60,12 @@ public abstract class AggPruleBase extends Prule {

   // If any of the aggregate functions are not one of these, then we

   // currently won't generate a 2 phase plan.

   protected boolean create2PhasePlan(RelOptRuleCall call,
DrillAggregateRel aggregate) {

-PlannerSettings settings =
PrelUtil.getPlannerSettings(call.getPlanner());

-RelNode child = call.rel(0).getInputs().get(0);

-boolean smallInput = child.getRows() < settings.getSliceTarget();

-if (! settings.isMultiPhaseAggEnabled() || settings.isSingleMode() ||
smallInput) {

-  return false;

-}

+//PlannerSettings settings =
PrelUtil.getPlannerSettings(call.getPlanner());

+//RelNode child = call.rel(0).getInputs().get(0);

+//boolean smallInput = child.getRows() < settings.getSliceTarget();

+//if (! settings.isMultiPhaseAggEnabled() || settings.isSingleMode()
|| smallInput) {

+//  return false;

+//}


 for (AggregateCall aggCall : aggregate.getAggCallList()) {

   String name = aggCall.getAggregation().getName();


Thanks,
Maryann



On Tue, Oct 6, 2015 at 2:31 PM, Julian Hyde  wrote:

> Drill's current approach seems adequate for Drill alone but extending
> it to a heterogenous system that includes Phoenix seems like a hack.
>
> I think you should only create Prels for algebra nodes that you know
> for sure are going to run on the Drill engine. If there's a
> possibility that it would run in another engine such as Phoenix then
> they should still be logical.
>
> On Tue, Oct 6, 2015 at 11:03 AM, Maryann Xue 
> wrote:
> > The partial aggregate seems to be working now, with one interface
> extension
> > and one bug fix in the Phoenix project. Will do some code cleanup and
> > create a pull request soon.
> >
> > Still there was a hack in the Drill project which I made to force 2-phase
> > aggregation. I'll try to fix that.
> >
> > Jacques, I have one question though, how can I verify that there are more
> > than one slice and the shuffle happens?
> >
> >
> > Thanks,
> > Maryann
> >
> > On Mon, Oct 5, 2015 at 2:03 PM, James Taylor 
> wrote:
> >
> >> Maryann,
> >> I believe Jacques mentioned that a little bit of refactoring is required
> >> for a merge sort to occur - there's something that does that, but it's
> not
> >> expected to be used in this context currently.
> >>
> >> IMHO, there's more of a clear value in getting the aggregation to use
> >> Phoenix first, so I'd recommend going down that road as Jacques
> mentioned
> >> above if possible. Once that's working, we can circle back to the
> partial
> >> sort.
> >>
> >> Thoughts?
> >> James
> >>
> >> On Mon, Oct 5, 2015 at 10:40 AM, Maryann Xue 
> >> wrote:
> >>
> >>> I actually tried implementing partial sort with
> >>> https://github.com/jacques-n/drill/pull/4, which I figured might be a
> >>> little easier to start with than partial aggregation. But I found that
> even
> >>> though the code worked (returned the right results), the Drill side
> sort
> >>> turned out to be a ordinary sort instead of a merge which it should
> have
> >>> been. Any idea of how to fix that?
> >>>
> >>>
> >>> Thanks,
> >>> Maryann
> >>>
> >>> On Mon, Oct 5, 2015 at 12:52 PM, Jacques Nadeau 
> >>> wrote:
> >>>
>  Right now this type of work is done here:
> 
> 
> 
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/HashAggPrule.java
> 
> 
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/AggPruleBase.java
> 
>  With Distribution Trait application here:
> 
> 
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/DrillDistributionTraitDef.java
> 
>  To me, the easiest way to solve the Phoenix issue is by providing a
> rule
>  that matches HashAgg and StreamAgg but requires Phoenix convention as
>  input. It would replace everywhere but would only be plannable when
> it is
>  the first phase of aggregation.
> 
>  Thoughts?
> 
> 
> 
>  --
>  Jacques 

[jira] [Created] (DRILL-3902) Bad error message: core cause not included in text; maybe wrong kind

2015-10-06 Thread Daniel Barclay (Drill) (JIRA)
Daniel Barclay (Drill) created DRILL-3902:
-

 Summary: Bad error message:  core cause not included in text; 
maybe wrong kind
 Key: DRILL-3902
 URL: https://issues.apache.org/jira/browse/DRILL-3902
 Project: Apache Drill
  Issue Type: Bug
Reporter: Daniel Barclay (Drill)


When trying to use an empty directory as a table causes Drill to fail by 
hitting an IndexOutOfBoundsException, the final error message includes the text 
from the IndexOutOfBoundsException's getMessage()--but fails to mention 
IndexOutOfBoundsException itself (or equivalent information):

{noformat}
0: jdbc:drill:zk=localhost:2181> SELECT *   FROM 
`dfs`.`root`.`/tmp/empty_directory`;
Error: VALIDATION ERROR: Index: 0, Size: 0


[Error Id: 66ff61ed-ea41-4af9-87c5-f91480ef1b21 on dev-linux2:31010] 
(state=,code=0)
0: jdbc:drill:zk=localhost:2181> 
{noformat}

Also, since this isn't a coherent/intentional validation error but an internal 
error, shouldn't this be a SYSTEM ERROR message?

(Does the SYSTEM ERROR case including the exception class name in the message?)

Daniel







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-3903) Querying empty directory yield internal index-out-of-bounds error

2015-10-06 Thread Daniel Barclay (Drill) (JIRA)
Daniel Barclay (Drill) created DRILL-3903:
-

 Summary: Querying empty directory yield internal 
index-out-of-bounds error
 Key: DRILL-3903
 URL: https://issues.apache.org/jira/browse/DRILL-3903
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Other
Reporter: Daniel Barclay (Drill)
Assignee: Jacques Nadeau


Trying to use an empty directory as a table results in an internal 
IndexOutOfBounds error:

{noformat}
0: jdbc:drill:zk=localhost:2181> SELECT *   FROM 
`dfs`.`root`.`/tmp/empty_directory`;
Error: VALIDATION ERROR: Index: 0, Size: 0


[Error Id: 66ff61ed-ea41-4af9-87c5-f91480ef1b21 on dev-linux2:31010] 
(state=,code=0)
0: jdbc:drill:zk=localhost:2181> 
{noformat}






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Partial aggregation in Drill-on-Phoenix

2015-10-06 Thread James Taylor
Nice progress, Maryann.

A few questions for you: not sure I understand the changes you made to
PhoenixRecordReader. Is it necessary to wrap the server-side scan results
in a GroupedAggregatingResultIterator? Each server-side scan will produce
results with a single tuple per group by key. In Phoenix, the
GroupedAggregatingResultIterator's function in life is to do the final
merge. Note too that the results aren't sorted that come back from the
aggregated scan (while GroupedAggregatingResultIterator needs tuples sorted
by the group by key). Or is this just to help in decoding the values coming
back from the scan?

Also, not sure what impact it has in the way we "combine" the scans in our
Drill parallelization code (PhoenixGroupScan.applyAssignments()), as each
of our scans could include duplicate group by keys. Is it ok to combine
them in this case?

One more question: how is the group by key communicated back to Drill?

Thanks,
James


On Tue, Oct 6, 2015 at 2:10 PM, Maryann Xue  wrote:

> Added a few fixes in the pull request. Tested with two regions, turned out
> that half of the result is empty (count = 0).
> Not sure if there's anything wrong with
> https://github.com/maryannxue/drill/blob/phoenix_plugin/contrib/storage-phoenix/src/main/java/org/apache/drill/exec/store/phoenix/rel/PhoenixHashAggPrule.java
> .
> Like Julian said, this rule looks a bit hacky.
>
> To force a 2-phase HashAgg, I made a temporary change as well:
>
> diff --git
> a/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/AggPruleBase.java
> b/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/AggPruleBase.java
>
> index b911f6b..58bc918 100644
>
> ---
> a/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/AggPruleBase.java
>
> +++
> b/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/AggPruleBase.java
>
> @@ -60,12 +60,12 @@ public abstract class AggPruleBase extends Prule {
>
>// If any of the aggregate functions are not one of these, then we
>
>// currently won't generate a 2 phase plan.
>
>protected boolean create2PhasePlan(RelOptRuleCall call,
> DrillAggregateRel aggregate) {
>
> -PlannerSettings settings =
> PrelUtil.getPlannerSettings(call.getPlanner());
>
> -RelNode child = call.rel(0).getInputs().get(0);
>
> -boolean smallInput = child.getRows() < settings.getSliceTarget();
>
> -if (! settings.isMultiPhaseAggEnabled() || settings.isSingleMode() ||
> smallInput) {
>
> -  return false;
>
> -}
>
> +//PlannerSettings settings =
> PrelUtil.getPlannerSettings(call.getPlanner());
>
> +//RelNode child = call.rel(0).getInputs().get(0);
>
> +//boolean smallInput = child.getRows() < settings.getSliceTarget();
>
> +//if (! settings.isMultiPhaseAggEnabled() || settings.isSingleMode()
> || smallInput) {
>
> +//  return false;
>
> +//}
>
>
>  for (AggregateCall aggCall : aggregate.getAggCallList()) {
>
>String name = aggCall.getAggregation().getName();
>
>
> Thanks,
> Maryann
>
>
>
> On Tue, Oct 6, 2015 at 2:31 PM, Julian Hyde  wrote:
>
>> Drill's current approach seems adequate for Drill alone but extending
>> it to a heterogenous system that includes Phoenix seems like a hack.
>>
>> I think you should only create Prels for algebra nodes that you know
>> for sure are going to run on the Drill engine. If there's a
>> possibility that it would run in another engine such as Phoenix then
>> they should still be logical.
>>
>> On Tue, Oct 6, 2015 at 11:03 AM, Maryann Xue 
>> wrote:
>> > The partial aggregate seems to be working now, with one interface
>> extension
>> > and one bug fix in the Phoenix project. Will do some code cleanup and
>> > create a pull request soon.
>> >
>> > Still there was a hack in the Drill project which I made to force
>> 2-phase
>> > aggregation. I'll try to fix that.
>> >
>> > Jacques, I have one question though, how can I verify that there are
>> more
>> > than one slice and the shuffle happens?
>> >
>> >
>> > Thanks,
>> > Maryann
>> >
>> > On Mon, Oct 5, 2015 at 2:03 PM, James Taylor 
>> wrote:
>> >
>> >> Maryann,
>> >> I believe Jacques mentioned that a little bit of refactoring is
>> required
>> >> for a merge sort to occur - there's something that does that, but it's
>> not
>> >> expected to be used in this context currently.
>> >>
>> >> IMHO, there's more of a clear value in getting the aggregation to use
>> >> Phoenix first, so I'd recommend going down that road as Jacques
>> mentioned
>> >> above if possible. Once that's working, we can circle back to the
>> partial
>> >> sort.
>> >>
>> >> Thoughts?
>> >> James
>> >>
>> >> On Mon, Oct 5, 2015 at 10:40 AM, Maryann Xue 
>> >> wrote:
>> >>
>> >>> I actually tried implementing partial sort with
>> >>> https://github.com/jacques-n/drill/pull/4, which I figured might be a
>> >>> little easier to start with than partial aggregation. But I found
>> that even
>> >>> though the code worked (returned the right results), the Drill

[jira] [Created] (DRILL-3904) Document support for multiple window functions in query

2015-10-06 Thread Bridget Bevens (JIRA)
Bridget Bevens created DRILL-3904:
-

 Summary: Document support for multiple window functions in query
 Key: DRILL-3904
 URL: https://issues.apache.org/jira/browse/DRILL-3904
 Project: Apache Drill
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Bridget Bevens
Assignee: Bridget Bevens


Edit second paragraph to state that query can include multiple window functions:

When you use a window function in a query, define the window using the OVER() 
clause. The OVER() clause (window definition) differentiates window functions 
from other analytical and reporting functions. A query can include multiple 
window functions with the same or different window definitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (DRILL-3904) Document support for multiple window functions in query

2015-10-06 Thread Bridget Bevens (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bridget Bevens resolved DRILL-3904.
---
Resolution: Fixed

Updated SQL window function intro on drill website to include the info

> Document support for multiple window functions in query
> ---
>
> Key: DRILL-3904
> URL: https://issues.apache.org/jira/browse/DRILL-3904
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Bridget Bevens
>Assignee: Bridget Bevens
>
> Edit second paragraph to state that query can include multiple window 
> functions:
> When you use a window function in a query, define the window using the OVER() 
> clause. The OVER() clause (window definition) differentiates window functions 
> from other analytical and reporting functions. A query can include multiple 
> window functions with the same or different window definitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-3905) Document DROP TABLE support

2015-10-06 Thread Bridget Bevens (JIRA)
Bridget Bevens created DRILL-3905:
-

 Summary: Document DROP TABLE support 
 Key: DRILL-3905
 URL: https://issues.apache.org/jira/browse/DRILL-3905
 Project: Apache Drill
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Bridget Bevens
Assignee: Bridget Bevens


Add documentation to Drill docs for DROP TABLE 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (DRILL-3905) Document DROP TABLE support

2015-10-06 Thread Bridget Bevens (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bridget Bevens resolved DRILL-3905.
---
Resolution: Fixed

Doc written, reviewed, added to Drill website

> Document DROP TABLE support 
> 
>
> Key: DRILL-3905
> URL: https://issues.apache.org/jira/browse/DRILL-3905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Bridget Bevens
>Assignee: Bridget Bevens
>
> Add documentation to Drill docs for DROP TABLE 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-3906) Add documentation for LEAD, LAD, FIRST_VALUE, LAST_VALUE and NTILE

2015-10-06 Thread Bridget Bevens (JIRA)
Bridget Bevens created DRILL-3906:
-

 Summary: Add documentation for LEAD, LAD, FIRST_VALUE, LAST_VALUE 
and NTILE
 Key: DRILL-3906
 URL: https://issues.apache.org/jira/browse/DRILL-3906
 Project: Apache Drill
  Issue Type: New Feature
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Bridget Bevens


Create docs for new window functions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (DRILL-3906) Add documentation for LEAD, LAD, FIRST_VALUE, LAST_VALUE and NTILE

2015-10-06 Thread Bridget Bevens (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bridget Bevens resolved DRILL-3906.
---
Resolution: Fixed

Docs created, edited, pushed to apache drill site.

> Add documentation for LEAD, LAD, FIRST_VALUE, LAST_VALUE and NTILE
> --
>
> Key: DRILL-3906
> URL: https://issues.apache.org/jira/browse/DRILL-3906
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Bridget Bevens
>
> Create docs for new window functions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Partial aggregation in Drill-on-Phoenix

2015-10-06 Thread Jacques Nadeau
I'm not sure how to accomplish this cleanly. The concept of two-phased
agg-key distributed aggregation (and exchanges in general) seems very much
a physical concept. Since Phoenix can only do half this operation (in
parallel), I'm having trouble figuring out what the logical plan would look
like if we did this transformation in the logical planning phase. (In
general, I think part of the problem is that each phase of planning can
output only a single plan.) Since we have chosen to break the planning into
~5 phases due to performance, we have to pick where the transformations are
most appropriate.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Oct 6, 2015 at 11:31 AM, Julian Hyde  wrote:

> Drill's current approach seems adequate for Drill alone but extending
> it to a heterogenous system that includes Phoenix seems like a hack.
>
> I think you should only create Prels for algebra nodes that you know
> for sure are going to run on the Drill engine. If there's a
> possibility that it would run in another engine such as Phoenix then
> they should still be logical.
>
> On Tue, Oct 6, 2015 at 11:03 AM, Maryann Xue 
> wrote:
> > The partial aggregate seems to be working now, with one interface
> extension
> > and one bug fix in the Phoenix project. Will do some code cleanup and
> > create a pull request soon.
> >
> > Still there was a hack in the Drill project which I made to force 2-phase
> > aggregation. I'll try to fix that.
> >
> > Jacques, I have one question though, how can I verify that there are more
> > than one slice and the shuffle happens?
> >
> >
> > Thanks,
> > Maryann
> >
> > On Mon, Oct 5, 2015 at 2:03 PM, James Taylor 
> wrote:
> >
> >> Maryann,
> >> I believe Jacques mentioned that a little bit of refactoring is required
> >> for a merge sort to occur - there's something that does that, but it's
> not
> >> expected to be used in this context currently.
> >>
> >> IMHO, there's more of a clear value in getting the aggregation to use
> >> Phoenix first, so I'd recommend going down that road as Jacques
> mentioned
> >> above if possible. Once that's working, we can circle back to the
> partial
> >> sort.
> >>
> >> Thoughts?
> >> James
> >>
> >> On Mon, Oct 5, 2015 at 10:40 AM, Maryann Xue 
> >> wrote:
> >>
> >>> I actually tried implementing partial sort with
> >>> https://github.com/jacques-n/drill/pull/4, which I figured might be a
> >>> little easier to start with than partial aggregation. But I found that
> even
> >>> though the code worked (returned the right results), the Drill side
> sort
> >>> turned out to be a ordinary sort instead of a merge which it should
> have
> >>> been. Any idea of how to fix that?
> >>>
> >>>
> >>> Thanks,
> >>> Maryann
> >>>
> >>> On Mon, Oct 5, 2015 at 12:52 PM, Jacques Nadeau 
> >>> wrote:
> >>>
>  Right now this type of work is done here:
> 
> 
> 
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/HashAggPrule.java
> 
> 
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/AggPruleBase.java
> 
>  With Distribution Trait application here:
> 
> 
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/DrillDistributionTraitDef.java
> 
>  To me, the easiest way to solve the Phoenix issue is by providing a
> rule
>  that matches HashAgg and StreamAgg but requires Phoenix convention as
>  input. It would replace everywhere but would only be plannable when
> it is
>  the first phase of aggregation.
> 
>  Thoughts?
> 
> 
> 
>  --
>  Jacques Nadeau
>  CTO and Co-Founder, Dremio
> 
>  On Thu, Oct 1, 2015 at 2:30 PM, Julian Hyde  wrote:
> 
> > Phoenix is able to perform quite a few relational operations on the
> > region server: scan, filter, project, aggregate, sort (optionally
> with
> > limit). However, the sort and aggregate are necessarily "local". They
> > can only deal with data on that region server, and there needs to be
> a
> > further operation to combine the results from the region servers.
> >
> > The question is how to plan such queries. I think the answer is an
> > AggregateExchangeTransposeRule.
> >
> > The rule would spot an Aggregate on a data source that is split into
> > multiple locations (partitions) and split it into a partial Aggregate
> > that computes sub-totals and a summarizing Aggregate that combines
> > those totals.
> >
> > How does the planner know that the Aggregate needs to be split? Since
> > the data's distribution has changed, there would need to be an
> > Exchange operator. It is the Exchange operator that triggers the rule
> > to fire.
> >
> > There are some special cases. If the data is sorted as well as
> > partitioned (say because the local aggregate uses a s

Drill Hangout minutes - 2015-10-06 Re: Drill Hangout starting now

2015-10-06 Thread Parth Chandra
Drill Hangout 2015-10-06

Attendees: Aman, Andries, Daniel, Kris, Charlie, Julien, Jacques, Jason,
Jinfeng, Matt, Parth, Sudheesh, Venki


   1.

   Matt hitting issues with Information Schema queries against Hive. Will
   connect with Venki on Slack to resolve.
   2.

   Julien reported that he's working on speeding up building and running
   tests, noting that build-time code generation runs twice and local
   Drillbits for testing take 3 second to shut down.
   3.

   Parth mentioned an off-by-one bug in Parquet reading and that he will
   add more Parquet reading tests as part of the fix.
   4.

   Aman reported a regression in performance while trying metadata caching
   with 400K files. This is being investigated.
   5.

   Daniel, Jacques, and Sudeesh discussed issues underlying DRILL-2288,
   such as the ScanBatch.next() return value (IterOutcome) contract, handling
   empty JSON files, handling zero-row sources that still have schemas, how to
   limit the DRILL-2288 fix to avoid needing to rework lots of downstream
   code, etc.
   6.

   Sudheesh had various updates - Limit 0 and Limit 1 queries. Jacques
   suggestion to handle Limit 0 queries on schema aware systems to the
   planning phase. Perf tests on the RPC processing offloading seem to show
   higher memory consumption. This may simply be due to allowing more
   concurrent queries as result of the patch. Perf tests reveal issues on
   local data tunnel changes but these may be existing problems that are now
   showing up as a result of faster local data processing. Question to be
   resolved - should we merge these anyway?
   7.

   Jason helping address some recent issues with flatten involving large
   number of repeated values.
   8.

   We unanimously volunteered Sudheesh to work on the performance cluster.




On Tue, Oct 6, 2015 at 10:06 AM, Parth Chandra  wrote:

>
>
> Join us here:
>> https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
>>
>
>


Re: Partial aggregation in Drill-on-Phoenix

2015-10-06 Thread Maryann Xue
Hi James,

bq. A few questions for you: not sure I understand the changes you made to
PhoenixRecordReader. Is it necessary to wrap the server-side scan results
in a GroupedAggregatingResultIterator? Each server-side scan will produce
results with a single tuple per group by key. In Phoenix, the
GroupedAggregatingResultIterator's function in life is to do the final
merge. Note too that the results aren't sorted that come back from the
aggregated scan (while GroupedAggregatingResultIterator needs tuples sorted
by the group by key). Or is this just to help in decoding the values coming
back from the scan?

It is necessary. I suppose what we should return as a partial result from
PhoenixRecordReader is exactly the same as what we do in standalone
Phoenix+Calcite, except that the result is partial or say incomplete. For
example, we have "select a, count(*) from t group by a", we should return
rows that have "a" as the first expression value, and "count(*)" as the
second expression value. For this "count" expression, it actually needs a
ClientAggregator for evaluation, and that's what this
GroupedAggregatingResultIterator is used for.
Since "each server-side scan will produce results with a single tuple per
group by key", and PhoenixRecordReader is only dealing with one server-side
result each time, we don't care how the group-by keys are arranged (ordered
or unordered"). Actually GroupedAggregatingResultIterator is not the
group-by iterator we use for AggregatePlan. It does not "combine". It
treats every row as a different group, by returning its rowkey as the
group-by key (GroupedAggregatingResultIterator.java:56).

In short, this iterator is for decoding the server-side values. So we may
want to optimize this logic by removing this serialization and
deserialization and having only one set of aggregators in future.

bq. Also, not sure what impact it has in the way we "combine" the scans in
our Drill parallelization code (PhoenixGroupScan.applyAssignments()), as
each of our scans could include duplicate group by keys. Is it ok to
combine them in this case?

It should not matter, or at least is not related to the problem I'm now
having.

bq. One more question: how is the group by key communicated back to Drill?

According to the HashAggPrule, if it decides to create a two-phase
aggregate, the first phase is now handled by Phoenix (after applying the
PhoenixHashAggPrule). I assume then the partial results gets shuffled based
on the hash of their group-by keys (returned by PhoenixRecordReader). The
final step is the Drill hash aggregation.


This is my test table "A.BEER", which has for columns: "B", "E1", "E2",
"R", all of INTEGER types. And the data is generated like this:
for (x=1 to N) { //currently N=1000
 UPSERT INTO A.BEER VALUES (x, x % 10, x % 100, x);
}

The group-by query for testing is "SELECT e1, count(*) FROM a.beer GROUP BY
e1".
The expected result should be:
0 100
1 100
2 100
3 100
4 100
5 100
6 100
7 100
8 100
9 100
The actual result was:
6 0
7 0
8 0
9 0
0 0
1 100
2 100
3 100
4 100
5 100

Here I just tried another one "SELECT e2, count(*) FROM a.beer GROUP BY e2".
Similarly, the expected result should have group-by keys from 0 to 99, each
having a value of 10 as the count, while the actual result was:
from group-by key 86 to 99, together with 0, their count values were all 0;
the rest (1 to 85) all had the correct value 10.

Looks to me that the scans were good but there was a problem with one of
the hash buckets.

Thanks,
Maryann


On Tue, Oct 6, 2015 at 6:45 PM, James Taylor  wrote:

> Nice progress, Maryann.
>
> A few questions for you: not sure I understand the changes you made to
> PhoenixRecordReader. Is it necessary to wrap the server-side scan results
> in a GroupedAggregatingResultIterator? Each server-side scan will produce
> results with a single tuple per group by key. In Phoenix, the
> GroupedAggregatingResultIterator's function in life is to do the final
> merge. Note too that the results aren't sorted that come back from the
> aggregated scan (while GroupedAggregatingResultIterator needs tuples sorted
> by the group by key). Or is this just to help in decoding the values coming
> back from the scan?
>
> Also, not sure what impact it has in the way we "combine" the scans in our
> Drill parallelization code (PhoenixGroupScan.applyAssignments()), as each
> of our scans could include duplicate group by keys. Is it ok to combine
> them in this case?
>
> One more question: how is the group by key communicated back to Drill?
>
> Thanks,
> James
>
>
> On Tue, Oct 6, 2015 at 2:10 PM, Maryann Xue  wrote:
>
>> Added a few fixes in the pull request. Tested with two regions, turned
>> out that half of the result is empty (count = 0).
>> Not sure if there's anything wrong with
>> https://github.com/maryannxue/drill/blob/phoenix_plugin/contrib/storage-phoenix/src/main/java/org/apache/drill/exec/store/phoenix/rel/PhoenixHashAggPrule.java
>> .
>> Like Julian said, this rule looks a bit hacky.
>>
>> 

Re: Partial aggregation in Drill-on-Phoenix

2015-10-06 Thread James Taylor
The results we get back from the server-side scan are already the partial
aggregated values we need. GroupedAggregatingResultIterator will collapse
adjacent Tuples together which happen to have the same row key. I'm not
sure we want/need this to happen. Instead I think we just need to decode
the aggregated values directly from the result of the scan.

On Tue, Oct 6, 2015 at 6:07 PM, Maryann Xue  wrote:

> Hi James,
>
> bq. A few questions for you: not sure I understand the changes you made to
> PhoenixRecordReader. Is it necessary to wrap the server-side scan results
> in a GroupedAggregatingResultIterator? Each server-side scan will produce
> results with a single tuple per group by key. In Phoenix, the
> GroupedAggregatingResultIterator's function in life is to do the final
> merge. Note too that the results aren't sorted that come back from the
> aggregated scan (while GroupedAggregatingResultIterator needs tuples sorted
> by the group by key). Or is this just to help in decoding the values coming
> back from the scan?
>
> It is necessary. I suppose what we should return as a partial result from
> PhoenixRecordReader is exactly the same as what we do in standalone
> Phoenix+Calcite, except that the result is partial or say incomplete. For
> example, we have "select a, count(*) from t group by a", we should return
> rows that have "a" as the first expression value, and "count(*)" as the
> second expression value. For this "count" expression, it actually needs a
> ClientAggregator for evaluation, and that's what this
> GroupedAggregatingResultIterator is used for.
> Since "each server-side scan will produce results with a single tuple per
> group by key", and PhoenixRecordReader is only dealing with one server-side
> result each time, we don't care how the group-by keys are arranged (ordered
> or unordered"). Actually GroupedAggregatingResultIterator is not the
> group-by iterator we use for AggregatePlan. It does not "combine". It
> treats every row as a different group, by returning its rowkey as the
> group-by key (GroupedAggregatingResultIterator.java:56).
>
> In short, this iterator is for decoding the server-side values. So we may
> want to optimize this logic by removing this serialization and
> deserialization and having only one set of aggregators in future.
>
> bq. Also, not sure what impact it has in the way we "combine" the scans in
> our Drill parallelization code (PhoenixGroupScan.applyAssignments()), as
> each of our scans could include duplicate group by keys. Is it ok to
> combine them in this case?
>
> It should not matter, or at least is not related to the problem I'm now
> having.
>
> bq. One more question: how is the group by key communicated back to Drill?
>
> According to the HashAggPrule, if it decides to create a two-phase
> aggregate, the first phase is now handled by Phoenix (after applying the
> PhoenixHashAggPrule). I assume then the partial results gets shuffled based
> on the hash of their group-by keys (returned by PhoenixRecordReader). The
> final step is the Drill hash aggregation.
>
>
> This is my test table "A.BEER", which has for columns: "B", "E1", "E2",
> "R", all of INTEGER types. And the data is generated like this:
> for (x=1 to N) { //currently N=1000
>  UPSERT INTO A.BEER VALUES (x, x % 10, x % 100, x);
> }
>
> The group-by query for testing is "SELECT e1, count(*) FROM a.beer GROUP
> BY e1".
> The expected result should be:
> 0 100
> 1 100
> 2 100
> 3 100
> 4 100
> 5 100
> 6 100
> 7 100
> 8 100
> 9 100
> The actual result was:
> 6 0
> 7 0
> 8 0
> 9 0
> 0 0
> 1 100
> 2 100
> 3 100
> 4 100
> 5 100
>
> Here I just tried another one "SELECT e2, count(*) FROM a.beer GROUP BY
> e2".
> Similarly, the expected result should have group-by keys from 0 to 99,
> each having a value of 10 as the count, while the actual result was:
> from group-by key 86 to 99, together with 0, their count values were all
> 0; the rest (1 to 85) all had the correct value 10.
>
> Looks to me that the scans were good but there was a problem with one of
> the hash buckets.
>
> Thanks,
> Maryann
>
>
> On Tue, Oct 6, 2015 at 6:45 PM, James Taylor 
> wrote:
>
>> Nice progress, Maryann.
>>
>> A few questions for you: not sure I understand the changes you made to
>> PhoenixRecordReader. Is it necessary to wrap the server-side scan results
>> in a GroupedAggregatingResultIterator? Each server-side scan will produce
>> results with a single tuple per group by key. In Phoenix, the
>> GroupedAggregatingResultIterator's function in life is to do the final
>> merge. Note too that the results aren't sorted that come back from the
>> aggregated scan (while GroupedAggregatingResultIterator needs tuples sorted
>> by the group by key). Or is this just to help in decoding the values coming
>> back from the scan?
>>
>> Also, not sure what impact it has in the way we "combine" the scans in
>> our Drill parallelization code (PhoenixGroupScan.applyAssignments()), as
>> each of our scans could include duplicate grou

Re: Partial aggregation in Drill-on-Phoenix

2015-10-06 Thread Maryann Xue
Yes, but the partially aggregated results will not contain any duplicate
rowkeys, since they are also group-by keys. What we need is the aggregators
and call aggregate for each row. We can write a new simpler ResultIterator
to replace this, but for now it should work correctly.

On Tue, Oct 6, 2015 at 9:45 PM, James Taylor  wrote:

> The results we get back from the server-side scan are already the partial
> aggregated values we need. GroupedAggregatingResultIterator will collapse
> adjacent Tuples together which happen to have the same row key. I'm not
> sure we want/need this to happen. Instead I think we just need to decode
> the aggregated values directly from the result of the scan.
>
> On Tue, Oct 6, 2015 at 6:07 PM, Maryann Xue  wrote:
>
>> Hi James,
>>
>> bq. A few questions for you: not sure I understand the changes you made
>> to PhoenixRecordReader. Is it necessary to wrap the server-side scan
>> results in a GroupedAggregatingResultIterator? Each server-side scan will
>> produce results with a single tuple per group by key. In Phoenix, the
>> GroupedAggregatingResultIterator's function in life is to do the final
>> merge. Note too that the results aren't sorted that come back from the
>> aggregated scan (while GroupedAggregatingResultIterator needs tuples sorted
>> by the group by key). Or is this just to help in decoding the values coming
>> back from the scan?
>>
>> It is necessary. I suppose what we should return as a partial result from
>> PhoenixRecordReader is exactly the same as what we do in standalone
>> Phoenix+Calcite, except that the result is partial or say incomplete. For
>> example, we have "select a, count(*) from t group by a", we should return
>> rows that have "a" as the first expression value, and "count(*)" as the
>> second expression value. For this "count" expression, it actually needs a
>> ClientAggregator for evaluation, and that's what this
>> GroupedAggregatingResultIterator is used for.
>> Since "each server-side scan will produce results with a single tuple
>> per group by key", and PhoenixRecordReader is only dealing with one
>> server-side result each time, we don't care how the group-by keys are
>> arranged (ordered or unordered"). Actually GroupedAggregatingResultIterator
>> is not the group-by iterator we use for AggregatePlan. It does not
>> "combine". It treats every row as a different group, by returning its
>> rowkey as the group-by key (GroupedAggregatingResultIterator.java:56).
>>
>> In short, this iterator is for decoding the server-side values. So we may
>> want to optimize this logic by removing this serialization and
>> deserialization and having only one set of aggregators in future.
>>
>> bq. Also, not sure what impact it has in the way we "combine" the scans
>> in our Drill parallelization code (PhoenixGroupScan.applyAssignments()),
>> as each of our scans could include duplicate group by keys. Is it ok to
>> combine them in this case?
>>
>> It should not matter, or at least is not related to the problem I'm now
>> having.
>>
>> bq. One more question: how is the group by key communicated back to Drill?
>>
>> According to the HashAggPrule, if it decides to create a two-phase
>> aggregate, the first phase is now handled by Phoenix (after applying the
>> PhoenixHashAggPrule). I assume then the partial results gets shuffled based
>> on the hash of their group-by keys (returned by PhoenixRecordReader). The
>> final step is the Drill hash aggregation.
>>
>>
>> This is my test table "A.BEER", which has for columns: "B", "E1", "E2",
>> "R", all of INTEGER types. And the data is generated like this:
>> for (x=1 to N) { //currently N=1000
>>  UPSERT INTO A.BEER VALUES (x, x % 10, x % 100, x);
>> }
>>
>> The group-by query for testing is "SELECT e1, count(*) FROM a.beer GROUP
>> BY e1".
>> The expected result should be:
>> 0 100
>> 1 100
>> 2 100
>> 3 100
>> 4 100
>> 5 100
>> 6 100
>> 7 100
>> 8 100
>> 9 100
>> The actual result was:
>> 6 0
>> 7 0
>> 8 0
>> 9 0
>> 0 0
>> 1 100
>> 2 100
>> 3 100
>> 4 100
>> 5 100
>>
>> Here I just tried another one "SELECT e2, count(*) FROM a.beer GROUP BY
>> e2".
>> Similarly, the expected result should have group-by keys from 0 to 99,
>> each having a value of 10 as the count, while the actual result was:
>> from group-by key 86 to 99, together with 0, their count values were all
>> 0; the rest (1 to 85) all had the correct value 10.
>>
>> Looks to me that the scans were good but there was a problem with one of
>> the hash buckets.
>>
>> Thanks,
>> Maryann
>>
>>
>> On Tue, Oct 6, 2015 at 6:45 PM, James Taylor 
>> wrote:
>>
>>> Nice progress, Maryann.
>>>
>>> A few questions for you: not sure I understand the changes you made to
>>> PhoenixRecordReader. Is it necessary to wrap the server-side scan results
>>> in a GroupedAggregatingResultIterator? Each server-side scan will produce
>>> results with a single tuple per group by key. In Phoenix, the
>>> GroupedAggregatingResultIterator's function in life is to do the final
>>> m

[GitHub] drill pull request: DRILL-3888: Build test jars for all Drill Modu...

2015-10-06 Thread adityakishore
GitHub user adityakishore opened a pull request:

https://github.com/apache/drill/pull/188

DRILL-3888: Build test jars for all Drill Modules

This patch moves the test jar configuration to the root pom and remove it 
from individual module's pom.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/adityakishore/drill DRILL-3888

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/188.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #188


commit 57f2bacfb0a7539ed2ec8ae3740a9e82e524141c
Author: Aditya Kishore 
Date:   2015-10-02T18:36:45Z

DRILL-3888: Build test jars for all Drill Modules




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---