Re: Two query questions on time bin function

Ian Maxon Mon, 22 Feb 2016 14:04:26 -0800

Regarding the latest query about computing some summary statistics from the
min-maxed data, Yiran and I just finished meeting and we were able to come
up with a work around.


The query was as follows:
use dataverse mt16

for $i in dataset HRM_binned_clipped return {
"row_id": $i.row_id,
"sid": $i.sid,
"gdate": $i.gdate,
"gday": $i.gday,
"timebin": $i.timebin,
"stdv_RR_clipped": avg($i.RR_clipped)
}

However, it would fail like this:

Type of argument in function-call: asterix:avg, Args:[function-call:
asterix:field-access-by-name, Args:[%0->$$0, AString: {RR_clipped}]] should
be a collection type instead of ANY [AlgebricksException]


Because HRM_binned_clipped was open and there is some sort of bug in the
avg() function for this. count() works fine.
The work around, is to just copy everything into a closed dataset, where
RR_clipped is of type [double]. The query then works.


Again though this is kind of a work-around, to a work around. The original
query:


declare function minmax($x){
let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5)
for $y in $x
where $y < (2*$stdv) + avg($x)
and $y > avg($x) - (2*$stdv)
return $y
}

for $i in dataset HRM
group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i
return {
"sid": $sid,
"gdate": $gdate,
"gday": $gday,
"timebin": $timebin,
"stdv": (avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR *
$ii.RR) - avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR)
* avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR))^(0.5)
}

Expresses the same thing without any intermediate datasets. This query
fails in compilation (only with the avgs added). I need to get the stack
from the version that Yiran is running on though, I can't reproduce it on
master; it fails in a different way.


-Ian


On Sun, Feb 21, 2016 at 2:55 PM, Mike Carey <[email protected]> wrote:

> ARGH!!!!  This is what we would like you to *not* have to do.  Sorry...
> Our aim is to be the Big Data antidote....
>
> @Yingyi:  Maybe you could take a quick peek at the query issue and
> see if there is any low-hanging hope there?
>
> @Yiran:  How big are your windows, typically?  (Number of data points.)
>
> Cheers,
> Mike
>
>
> On 2/21/16 2:42 PM, Yiran Wang wrote:
>
>> Thank you Mike for your update and suggestions! And thank you Ian again
>> for working with me.
>>
>> A little update from my end:
>>
>> I have been working on query (1) over the weekend for a work-around. I
>> tried to simultaneously calculate the stdev on the new copy of the list of
>> values with outliers removed, which in nature is the same as the query (2).
>> So I ran into the same problem that the query did not compile.
>>
>> What I did was to export the entire dataset with the outliers removed
>> into Excel and calculate the stdev in Excel. However, the entire dataset is
>> now 363,466 x 200+ in dimension. Though they do not exceed the row x col
>> limit in Excel, the memory of my computer is not big enough to do anything
>> useful without crashing. So I've been breaking the dataset into smaller
>> parts and working on each separately.
>>
>> Yiran
>>
>>
>>
>>
>> On Sun, Feb 21, 2016 at 2:10 PM, Mike Carey <[email protected] <mailto:
>> [email protected]>> wrote:
>>
>>     Ian,
>>
>>     Thanks working with Yiran on this!  I think there is "good" and
>>     bad news w.r.t these queries:
>>
>>      - The bad news is that they go beyond what we are likely to
>>     optimize at all well at present,
>>         as they go beyond what typical DB aggregate functions like
>>     min/max/avg/count/sum do.
>>         (I would try forming the groups and then doing these things on
>>     the groups, but saying
>>         them in AQL will be tricky, and may lead to queries that hit
>>     edge cases in the optimizer.
>>         For some of these my thought was to try using a positional
>>     variable within a group...?)
>>
>>      - The "good" news (only for AsterixDB) is that this is exactly
>>     the sort of inspiration that we
>>         are looking for in terms of understanding how to better for
>>     query-based analytics in real
>>         use cases (and this is a very real one!).
>>
>>     To quote a short paper I reviewed just this AM on SQL queries kind
>>     of like these:  "Percentage
>>     queries are more complex than their conventional counterparts and
>>     introduce new challenges
>>     for optimization."  (The paper didn't have an applicable solution
>>     for us, sadly.)
>>
>>     A more general facility that I wish we could offer was to do
>>     grouping in AsterixDB but then
>>     have the ability to pass a group to (e.g.) R and then get results
>>     back for the group.  When
>>     groups are small-ish (like Yiran's windows) that would be pretty
>>     cool - then one could do
>>     all sorts of advanced things per group.
>>
>>     Cheers,
>>     Mike
>>
>>     On 2/21/16 12:35 AM, Ian Maxon wrote:
>>
>>>     Yiran and I came up with possible answers for these...
>>>     For 1) , a function could be used that looks something like this:
>>>
>>>     declare function minmax($x){
>>>     let $stdv := (avg(for $z in $x return $z*$z) - avg($x) *
>>> avg($x))^(0.5)
>>>     for $y in $x
>>>     where $y < (2*$stdv) + avg($x)
>>>     and $y > avg($x) - (2*$stdv)
>>>     return $y
>>>     }
>>>
>>>
>>>
>>>     And then applied to return a new copy of the list of values,
>>> removing ones
>>>     that are outside of 2 stdev.
>>>
>>>     For 2), we also did come up with a potential solution ,but the query
>>> fails
>>>     to compile (Filed ashttps://
>>> issues.apache.org/jira/browse/ASTERIXDB-1308  )
>>>
>>>
>>>     Any thoughts on these queries would be welcome :) 1) especially seems
>>>     inefficient to do as a function.
>>>
>>>     - Ian
>>>
>>>     On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang<[email protected]>
>>> <mailto:[email protected]>  wrote:
>>>
>>>     Hi Asterix team,
>>>>
>>>>     I have two queries I'm struggling with. I'm hoping you could
>>>> provide a
>>>>     direction for me. Thanks in advance!
>>>>
>>>>     Here is what the data structure looks like:
>>>>
>>>>     create type HRMType as closed {
>>>>
>>>>        row_id: int32,
>>>>
>>>>        sid: int32,
>>>>
>>>>        date: date,
>>>>
>>>>        day: int32,
>>>>
>>>>        time: time,
>>>>
>>>>        bpm: int32,
>>>>
>>>>        RR: float
>>>>
>>>>     };
>>>>
>>>>     create dataset HRM (HRMType)
>>>>
>>>>     primary key row_id;
>>>>
>>>>
>>>>     Previously I have used the time bin function to calculate the
>>>> standard
>>>>     deviation of bpm for each time bin:
>>>>
>>>>     for $i in dataset HRM
>>>>
>>>>     group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day,
>>>> $timebin :=
>>>>     interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M"))
>>>> with $i
>>>>
>>>>     return {
>>>>
>>>>     "sid": $sid,
>>>>
>>>>     "gdate": $gdate,
>>>>
>>>>     "gday": $gday,
>>>>
>>>>     "timebin": $timebin,
>>>>
>>>>     "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in
>>>> $i
>>>>     return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};
>>>>
>>>>     Now I have two things I am hoping to do but need help with:
>>>>
>>>>     1. For each 1-min time bin, remove the bpm values that are above
>>>> the top
>>>>     5% or below the bottom 5%. I thought about using the min/max
>>>> function for a
>>>>     few times to achieve this, but realized that it was not a good idea
>>>> because
>>>>     in each time bin, the number of instances are not always the same.
>>>> So for
>>>>     each 1-min time bin, we do need to calculate the 5% and 95%
>>>> threshold, and
>>>>     remove instances accordingly, which I don't know how to do.
>>>>
>>>>     2. After removing the outliers of bpm for each 1-min time bin,
>>>> calculate a
>>>>     median absolute deviation (MAD) for each 1-min time bin (as another
>>>> measure
>>>>     of variation besides the standard deviation). MAD =
>>>>     median(abs(x-median(x)). I'm not sure how to write a query to do
>>>> the median
>>>>     function in Asterix.
>>>>
>>>>     Thank you so much in advance. Let me know if my questions are clear.
>>>>
>>>>     Yiran
>>>>
>>>>     --
>>>>     Best,
>>>>     Yiran
>>>>
>>>>     --
>>>>     You received this message because you are subscribed to the Google
>>>> Groups
>>>>     "asterixdb-dev" group.
>>>>     To unsubscribe from this group and stop receiving emails from it,
>>>> send an
>>>>     email [email protected]
>>>>     <mailto:[email protected]>.
>>>>     For more options, visithttps://groups.google.com/d/optout.
>>>>
>>>>
>>
>>
>>
>> --
>> Best,
>> Yiran
>>
>
>

Re: Two query questions on time bin function

Reply via email to