Re: group by problem

James Taylor Wed, 08 Apr 2015 14:08:44 -0700

Any JIRA yet, Marek? It'd be good to get to the bottom of this. Your
work around is not going to perform nearly as well as using TRUNC on
the date.
Thanks,
James


On Tue, Apr 7, 2015 at 8:53 AM, James Taylor <[email protected]> wrote:
> Yes, please open a JIRA and attach that CSV (or ideally the smallest subset
> that exhibits the problem).
> Thanks,
> James
>
>
> On Tuesday, April 7, 2015, Marek Wiewiorka <[email protected]>
> wrote:
>>
>> Hi James - shall I still open a JIRA for that?
>> Thanks!
>> Marek
>>
>> 2015-04-06 22:48 GMT+02:00 Marek Wiewiorka <[email protected]>:
>>>
>>> psql from a csv file:
>>> ./psql.py dwh:2181:/hbase-unsecure -t SE_DWH.HOMES_USAGE_HOUR
>>> /mnt/spark/export/usage_convert.txt/usage_merged.csv
>>>
>>> Here is a sample:
>>> 2015-03-19 23:59:59,6854,0,2.167
>>> 2015-03-19 22:59:59,6854,0,2.421
>>> 2015-03-19 21:59:59,6854,0,2.738
>>> 2015-03-19 20:59:59,6854,0,0.9490000000000001
>>> 2015-03-19 19:59:59,6854,0,0.748
>>> 2015-03-19 18:59:59,6854,0,2.76
>>> 2015-03-19 17:59:59,6854,0,1.801
>>> 2015-03-19 16:59:59,6854,0,0.661
>>> 2015-03-19 15:59:59,6854,0,1.082
>>> 2015-03-19 14:59:59,6854,0,1.303
>>>
>>>
>>> M.
>>>
>>> 2015-04-06 22:38 GMT+02:00 James Taylor <[email protected]>:
>>>>
>>>> How did you input the data?
>>>>
>>>> On Mon, Apr 6, 2015 at 1:27 PM, Marek Wiewiorka
>>>> <[email protected]> wrote:
>>>> > Oh I'm sorry I forgot to attach them:
>>>> > DDL of my table:
>>>> > create table se_dwh.homes_usage_hour (hu_ts time not null ,hu_ho_id
>>>> > integer
>>>> > not null ,hu_stream_id integer not null, hu_usage double constraint pk
>>>> > PRIMARY KEY(hu_ts,hu_ho_id,hu_stream_id) );
>>>> >
>>>> > Phoenix: 4.3.0
>>>> >
>>>> > Thanks,
>>>> > Marek
>>>> >
>>>> >
>>>> > 2015-04-06 22:25 GMT+02:00 James Taylor <[email protected]>:
>>>> >>
>>>> >> Hi Marek,
>>>> >> How did you input the data and what does your CREATE TABLE/VIEW
>>>> >> statement look like? What version of Phoenix and HBase are you using?
>>>> >> Thanks,
>>>> >> James
>>>> >>
>>>> >> On Monday, April 6, 2015, Marek Wiewiorka <[email protected]>
>>>> >> wrote:
>>>> >>>
>>>> >>> Hi James - sure here is the result of your query (limited to 5
>>>> >>> rows):
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>> |              TO_CHAR(HU_TS)              |
>>>> >>> HU_HO_ID
>>>> >>> |               HU_STREAM_ID               |
>>>> >>> HU_USAGE
>>>> >>> |
>>>> >>>
>>>> >>>
>>>> >>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>> | 2015-03-19 00:59:59.000                  | 4720
>>>> >>> | 0                                        | 0.287
>>>> >>> |
>>>> >>> | 2015-03-19 00:59:59.000                  | 6854
>>>> >>> | 0                                        | 3.6189999999999998
>>>> >>> |
>>>> >>> | 2015-03-19 01:59:59.000                  | 4720
>>>> >>> | 0                                        | 0.323
>>>> >>> |
>>>> >>> | 2015-03-19 01:59:59.000                  | 6854
>>>> >>> | 0                                        | 2.556
>>>> >>> |
>>>> >>> | 2015-03-19 02:59:59.000                  | 4720
>>>> >>> | 0                                        | 0.37
>>>> >>> |
>>>> >>>
>>>> >>>
>>>> >>> +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>
>>>> >>> Ok - I will try to prepare a sample csv with DDL/DML.
>>>> >>>
>>>> >>> Thanks again,
>>>> >>> Marek
>>>> >>>
>>>> >>>
>>>> >>> 2015-04-06 22:06 GMT+02:00 James Taylor <[email protected]>:
>>>> >>>>
>>>> >>>> Hi Marek,
>>>> >>>> Thanks for the additional information. If you could answer my
>>>> >>>> earlier
>>>> >>>> questions, that would be helpful.
>>>> >>>>
>>>> >>>> If you can't repro with a simple test case, then how about
>>>> >>>> attaching a
>>>> >>>> csv dump of some of your data (the smallest amount of data that
>>>> >>>> repros
>>>> >>>> the issue) to a JIRA along with the CREATE TABLE statement and the
>>>> >>>> query?
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> James
>>>> >>>>
>>>> >>>> On Mon, Apr 6, 2015 at 12:49 PM, Marek Wiewiorka
>>>> >>>> <[email protected]> wrote:
>>>> >>>> > Hi Guys - thanks for your messages.
>>>> >>>> >
>>>> >>>> > I did another round of testing I found that if I use to_char
>>>> >>>> > instead
>>>> >>>> > of
>>>> >>>> > trunc function I'm getting the expected result:
>>>> >>>> >
>>>> >>>> > select to_char(dt,'yyyy-MM-dd'),id,stream,sum(usage) from usage
>>>> >>>> > where
>>>> >>>> > id in
>>>> >>>> > (4720,6854) group by to_char(dt,'yyyy-MM-dd'),stream,id;
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > |               TO_CHAR(DT)                |
>>>> >>>> > ID
>>>> >>>> > |                  STREAM                  |
>>>> >>>> > SUM(USAGE)
>>>> >>>> > |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > | 2015-03-19                               | 4720
>>>> >>>> > | 0                                        | 8.405999999999999
>>>> >>>> > |
>>>> >>>> > | 2015-03-19                               | 6854
>>>> >>>> > | 0                                        | 28.339000000000006
>>>> >>>> > |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> >
>>>> >>>> > the same query with trunc returns 48 rows (24 rows per id)
>>>> >>>> > select trunc(dt,'DAY') as day,id,stream,sum(usage) from usage
>>>> >>>> > where id
>>>> >>>> > in
>>>> >>>> > (4720,6854) group by trunc(dt,'DAY'),stream,id;
>>>> >>>> > e.g.:
>>>> >>>> > ....
>>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 0.406                                    |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 0.9490000000000001                       |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 0.332                                    |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 2.738                                    |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 0.33499999999999996                      |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 2.421                                    |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 4720
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 0.45599999999999996                      |
>>>> >>>> > | 2015-03-19 00:00:00.000 | 6854
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 2.167                                    |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> >
>>>> >>>> > so for some reason grouping by trunc is not working...but using
>>>> >>>> > to_char is.
>>>> >>>> > This applies for table loaded using psql from csv file.
>>>> >>>> >
>>>> >>>> > When tried to create a sample table and populate it with upserts
>>>> >>>> > everything
>>>> >>>> > worked as expected in both cases:
>>>> >>>> >
>>>> >>>> > create table usage (dt time not null ,id integer not null,stream
>>>> >>>> > integer not
>>>> >>>> > null, usage double constraint pk PRIMARY KEY(dt,id,stream ));
>>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 10:00:00',100,0,2.0);
>>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 11:00:00',100,0,3.0);
>>>> >>>> >
>>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 12:00:00',200,0,4.0);
>>>> >>>> > UPSERT INTO usage VALUES('2015-04-01 13:00:00',200,0,6.0);
>>>> >>>> >
>>>> >>>> > 0: jdbc:phoenix:dwh.sayene.com:2181:/hbase-un> select
>>>> >>>> > trunc(dt,'DAY')
>>>> >>>> > as
>>>> >>>> > day,id,stream,sum(usage) from usage where id in (100,200) group
>>>> >>>> > by
>>>> >>>> > trunc(dt,'DAY'),stream,id;
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > |           DAY           |                    ID
>>>> >>>> > |
>>>> >>>> > STREAM                  |                SUM(USAGE)
>>>> >>>> > |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > | 2015-04-01 00:00:00.000 | 100
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 5.0                                      |
>>>> >>>> > | 2015-04-01 00:00:00.000 | 200
>>>> >>>> > |
>>>> >>>> > 0
>>>> >>>> > | 10.0                                     |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +-------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> >
>>>> >>>> > 0: jdbc:phoenix:dwh.sayene.com:2181:/hbase-un> select
>>>> >>>> > to_char(dt,'yyyy-MM-dd'),id,stream,sum(usage) from usage where id
>>>> >>>> > in
>>>> >>>> > (100,200) group by to_char(dt,'yyyy-MM-dd'),stream,id;
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > |               TO_CHAR(DT)                |
>>>> >>>> > ID
>>>> >>>> > |                  STREAM                  |
>>>> >>>> > SUM(USAGE)
>>>> >>>> > |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > | 2015-04-01                               | 100
>>>> >>>> > | 0                                        | 5.0
>>>> >>>> > |
>>>> >>>> > | 2015-04-01                               | 200
>>>> >>>> > | 0                                        | 10.0
>>>> >>>> > |
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +------------------------------------------+------------------------------------------+------------------------------------------+------------------------------------------+
>>>> >>>> > 2 rows selected (1.49 seconds)
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > Shall I open a jira for that?
>>>> >>>> >
>>>> >>>> > Regards,
>>>> >>>> > Marek
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > 2015-04-06 20:16 GMT+02:00 James Taylor <[email protected]>:
>>>> >>>> >>
>>>> >>>> >> Hi Marek,
>>>> >>>> >> How did you input the data and what does your CREATE TABLE/VIEW
>>>> >>>> >> statement look like? What version of Phoenix and HBase are you
>>>> >>>> >> using?
>>>> >>>> >>
>>>> >>>> >> Also, would you mind running the following query and letting us
>>>> >>>> >> know
>>>> >>>> >> the
>>>> >>>> >> output?
>>>> >>>> >>
>>>> >>>> >> select to_char(hu_ts,'yyyy-MM-dd
>>>> >>>> >> HH:mm:ss.SSS'),hu_ho_id,hu_stream_id,hu_usage
>>>> >>>> >> from se_dwh.homes_usage_hour
>>>> >>>> >> where hu_ho_id in (4720,6854);
>>>> >>>> >>
>>>> >>>> >> Thanks,
>>>> >>>> >> James
>>>> >>>> >>
>>>> >>>> >> On Mon, Apr 6, 2015 at 10:34 AM, Gabriel Reid
>>>> >>>> >> <[email protected]>
>>>> >>>> >> wrote:
>>>> >>>> >> > That certainly looks like a bug. Would it be possible to make
>>>> >>>> >> > a
>>>> >>>> >> > small
>>>> >>>> >> > reproducible test case and if possible, log this in the
>>>> >>>> >> > Phoenix
>>>> >>>> >> > JIRA
>>>> >>>> >> > (https://issues.apache.org/jira/browse/PHOENIX) ?
>>>> >>>> >> >
>>>> >>>> >> > - Gabriel
>>>> >>>> >> >
>>>> >>>> >> > On Mon, Apr 6, 2015 at 6:10 PM Marek Wiewiorka
>>>> >>>> >> > <[email protected]>
>>>> >>>> >> > wrote:
>>>> >>>> >> >>
>>>> >>>> >> >> Hi All,
>>>> >>>> >> >> I came across a weird situation while running a query with
>>>> >>>> >> >> group
>>>> >>>> >> >> by.
>>>> >>>> >> >> I executed 2 queries:
>>>> >>>> >> >>
>>>> >>>> >> >> 1)
>>>> >>>> >> >> select
>>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id,hu_stream_id,sum(hu_usage)
>>>> >>>> >> >> from
>>>> >>>> >> >> se_dwh.homes_usage_hour where hu_ho_id in (4720,6854) group
>>>> >>>> >> >> by
>>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id ,hu_stream_id;
>>>> >>>> >> >>
>>>> >>>> >> >> | 2015-03-19 00:00:00.000 | 4720
>>>> >>>> >> >> |
>>>> >>>> >> >> 0
>>>> >>>> >> >> | 0.45599999999999996                      |
>>>> >>>> >> >> | 2015-03-19 00:00:00.000 | 6854
>>>> >>>> >> >> |
>>>> >>>> >> >> 0
>>>> >>>> >> >> | 2.167                                    |
>>>> >>>> >> >>
>>>> >>>> >> >>
>>>> >>>> >> >> 2)
>>>> >>>> >> >>
>>>> >>>> >> >> select
>>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id,hu_stream_id,sum(hu_usage)
>>>> >>>> >> >> from
>>>> >>>> >> >> se_dwh.homes_usage_hour where hu_ho_id in (4720) group by
>>>> >>>> >> >> trunc(hu_ts,'DAY',1),hu_ho_id ,hu_stream_id;
>>>> >>>> >> >>
>>>> >>>> >> >> | 2015-03-19 00:00:00.000 | 4720
>>>> >>>> >> >> |
>>>> >>>> >> >> 0
>>>> >>>> >> >> | 8.405999999999999                        |
>>>> >>>> >> >>
>>>> >>>> >> >>
>>>> >>>> >> >> The only difference is that in the first case I included 2
>>>> >>>> >> >> ids(4720,6854)
>>>> >>>> >> >> and in the other one only 1 (4720).
>>>> >>>> >> >> The result for hu_ho_id=4720 should be the same in both case
>>>> >>>> >> >> but
>>>> >>>> >> >> it
>>>> >>>> >> >> isn't
>>>> >>>> >> >> (0.45 vs 8.4).
>>>> >>>> >> >> The second result(8.4) is correct.
>>>> >>>> >> >>
>>>> >>>> >> >> Am I doing something wrong?
>>>> >>>> >> >>
>>>> >>>> >> >> Regards,
>>>> >>>> >> >> Marek
>>>> >>>> >> >>
>>>> >>>> >> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>
>>>> >>>
>>>> >
>>>
>>>
>>
>

Re: group by problem

Reply via email to