Re: Rolling hourly data

Monil Gandhi Tue, 27 Nov 2018 18:44:51 -0800

Thanks much for the response. I don’t think Phoenix does roll ups or at
least the version we are using. We are constrained by emr on aws which is
on 4.14


Thomas - what I meant was I want to write a mr job or “update value” query
for all of the sids and there are about 200k different values. It would be
very easy of I can write a MR job to select all rows in a day and just
reduce it to one row in a different table based on the 3 primary keys. But
because day is the last part of primary key, I keep getting errors or the
MR job just takes a long time.

One thing I forgot to mention is that we do not need hourly granularity of
data during query time. It is purely for “real” time ingestion and getting
the latest data in the system.
Hence at this point I am thinking I will generate daily data in a different
offline pipeline and just keep on updating the same row during ingestion.
On Tue, Nov 27, 2018 at 12:19 AM Jonathan Leech <[email protected]> wrote:

> I would try writing the hourly values as 24 columns in a daily row, or as
> an array type.
>
>  I’m not up to speed on the latest Phoenix features, but if it could
> update a daily sum on the fly that might be ok. If that doesn’t exist yet
> or isn’t performant, it could be done in an Hbase coprocessor.
>
> I would also put the daily sum column in a separate column family in the
> same table, rather than a separate table.
>
> - Jonathan
>
> On Nov 26, 2018, at 5:53 PM, Thomas D'Silva <[email protected]>
> wrote:
>
> The 1-day aggregate query you are running should work, you might have to
> increase the client thread pool and queue size.
> See
> http://mail-archives.apache.org/mod_mbox/phoenix-user/201607.mbox/%[email protected]%3E
>
> On Tue, Nov 20, 2018 at 9:26 PM Monil Gandhi <[email protected]> wrote:
>
>> Thanks for getting back.
>> The query we run is always an aggregation query. The date range is
>> anytime current to 2 years back.
>> I was looking into UPSERT SELECT, but there does not seem to be an easy
>> way to run it. For instance I thought running something as below would be
>> optimal
>>
>> UPSERT INTO DAILY_PUBLISHER_V4(segment_id, cross_segment_id, day, reach)
>> VALUES (SELECT segment_id, cross_segment_id, TO_DATE('2017-07-01',
>> 'yyyy-MM-dd', 'GMT') as day, CAST(SUM(reach) AS BIGINT) as reach FROM
>> PUBLISHER_V4 WHERE day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', 'GMT')
>> AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY segment_id,
>> cross_segment_id)
>>
>> Hence I tried running the select statement to see if it works. I started
>> to see below error
>> Error: Task
>> org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask@a66e580
>> rejected from org.apache.phoenix.job.JobManager$1@20de05e5[Running, pool
>> size = 128, active threads = 128, queued tasks = 5000, completed tasks =
>> 5153] (state=08000,code=101)
>>
>> When I changed the SELECT query to include a particular sid, the upsert
>> select worked.
>> Hence I think the only way would be for me to run UPSERt for generating
>> daily data for range of sids or segment_id.
>>
>> Did I miss something?
>>
>> On Tue, Nov 20, 2018 at 9:59 AM Thomas D'Silva <[email protected]>
>> wrote:
>>
>>> Since your PK already leads with (sid, day) I don't think adding a
>>> secondary index will help. Do you generally always run the aggregation
>>> query for the recently inserted data? The row timestamp feature might help
>>> in this case
>>> https://phoenix.apache.org/rowtimestamp.html
>>> If you run the same aggregate queries multiple times then another
>>> approach is to store the coarser daily aggregated data in a separate table
>>> that you can populate using an UPSERT SELECT.
>>> I'm not sure why the explain plan you attached has a CLIENT MERGE SORT,
>>> since you don't have an order by.
>>>
>>> On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <[email protected]> wrote:
>>>
>>>> Here it is
>>>> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4
>>>> [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420
>>>> SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID]
>>>> CLIENT MERGE SORT
>>>>
>>>> Note: we have a dedicated phoenix query server
>>>>
>>>> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <[email protected]>
>>>> wrote:
>>>>
>>>>> Monil,
>>>>>
>>>>> Could you please post the results of an EXPLAIN plan of your query?
>>>>> For directions how to do this please see
>>>>> http://phoenix.apache.org/explainplan.html
>>>>>
>>>>> Geoffrey Jacoby
>>>>>
>>>>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>> Currently we have hourly data in our phoenix table. However, the
>>>>>> schema was designed to perform well for daily data. Increasing the number
>>>>>> of rows by 24X has lead to degradation of our service over time.
>>>>>> Our current schema is as follows
>>>>>>
>>>>>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT
>>>>>> NULL, cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY 
>>>>>> (sid,
>>>>>> day, cid) ) COMPRESSION='SNAPPY'
>>>>>>
>>>>>> The query we run is something along the lines of
>>>>>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM
>>>>>> PUBLISHER_V4 WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01',
>>>>>> 'yyyy-MM-dd', 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP
>>>>>> BY sid, cid
>>>>>>
>>>>>> Based on our investigation we have concluded that the main reason is
>>>>>> purely the number of rows that are being read. I am open to other
>>>>>> suggestions
>>>>>>
>>>>>> If number of rows is the case
>>>>>> I am wondering if there is a way to either
>>>>>> 1. to roll hourly data to daily using views, secondary index or map
>>>>>> reduce. I know map reduce is possible.
>>>>>> 2. migrate to a newer schema where cid is not part of pk and is
>>>>>> actually a column family. I was unable to find any kind of documentation 
>>>>>> on
>>>>>> this.
>>>>>>
>>>>>> Thanks
>>>>>> Monil
>>>>>>
>>>>>

Re: Rolling hourly data

Reply via email to