Regarding the latest query about computing some summary statistics from the
min-maxed data, Yiran and I just finished meeting and we were able to come
up with a work around.
The query was as follows:
use dataverse mt16
for $i in dataset HRM_binned_clipped return {
"row_id": $i.row_id,
"sid": $i.sid,
"gdate": $i.gdate,
"gday": $i.gday,
"timebin": $i.timebin,
"stdv_RR_clipped": avg($i.RR_clipped)
}
However, it would fail like this:
Type of argument in function-call: asterix:avg, Args:[function-call:
asterix:field-access-by-name, Args:[%0->$$0, AString: {RR_clipped}]] should
be a collection type instead of ANY [AlgebricksException]
Because HRM_binned_clipped was open and there is some sort of bug in the
avg() function for this. count() works fine.
The work around, is to just copy everything into a closed dataset, where
RR_clipped is of type [double]. The query then works.
Again though this is kind of a work-around, to a work around. The original
query:
declare function minmax($x){
let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5)
for $y in $x
where $y < (2*$stdv) + avg($x)
and $y > avg($x) - (2*$stdv)
return $y
}
for $i in dataset HRM
group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i
return {
"sid": $sid,
"gdate": $gdate,
"gday": $gday,
"timebin": $timebin,
"stdv": (avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR *
$ii.RR) - avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR)
* avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR))^(0.5)
}
Expresses the same thing without any intermediate datasets. This query
fails in compilation (only with the avgs added). I need to get the stack
from the version that Yiran is running on though, I can't reproduce it on
master; it fails in a different way.
-Ian
On Sun, Feb 21, 2016 at 2:55 PM, Mike Carey <[email protected]> wrote:
> ARGH!!!! This is what we would like you to *not* have to do. Sorry...
> Our aim is to be the Big Data antidote....
>
> @Yingyi: Maybe you could take a quick peek at the query issue and
> see if there is any low-hanging hope there?
>
> @Yiran: How big are your windows, typically? (Number of data points.)
>
> Cheers,
> Mike
>
>
> On 2/21/16 2:42 PM, Yiran Wang wrote:
>
>> Thank you Mike for your update and suggestions! And thank you Ian again
>> for working with me.
>>
>> A little update from my end:
>>
>> I have been working on query (1) over the weekend for a work-around. I
>> tried to simultaneously calculate the stdev on the new copy of the list of
>> values with outliers removed, which in nature is the same as the query (2).
>> So I ran into the same problem that the query did not compile.
>>
>> What I did was to export the entire dataset with the outliers removed
>> into Excel and calculate the stdev in Excel. However, the entire dataset is
>> now 363,466 x 200+ in dimension. Though they do not exceed the row x col
>> limit in Excel, the memory of my computer is not big enough to do anything
>> useful without crashing. So I've been breaking the dataset into smaller
>> parts and working on each separately.
>>
>> Yiran
>>
>>
>>
>>
>> On Sun, Feb 21, 2016 at 2:10 PM, Mike Carey <[email protected] <mailto:
>> [email protected]>> wrote:
>>
>> Ian,
>>
>> Thanks working with Yiran on this! I think there is "good" and
>> bad news w.r.t these queries:
>>
>> - The bad news is that they go beyond what we are likely to
>> optimize at all well at present,
>> as they go beyond what typical DB aggregate functions like
>> min/max/avg/count/sum do.
>> (I would try forming the groups and then doing these things on
>> the groups, but saying
>> them in AQL will be tricky, and may lead to queries that hit
>> edge cases in the optimizer.
>> For some of these my thought was to try using a positional
>> variable within a group...?)
>>
>> - The "good" news (only for AsterixDB) is that this is exactly
>> the sort of inspiration that we
>> are looking for in terms of understanding how to better for
>> query-based analytics in real
>> use cases (and this is a very real one!).
>>
>> To quote a short paper I reviewed just this AM on SQL queries kind
>> of like these: "Percentage
>> queries are more complex than their conventional counterparts and
>> introduce new challenges
>> for optimization." (The paper didn't have an applicable solution
>> for us, sadly.)
>>
>> A more general facility that I wish we could offer was to do
>> grouping in AsterixDB but then
>> have the ability to pass a group to (e.g.) R and then get results
>> back for the group. When
>> groups are small-ish (like Yiran's windows) that would be pretty
>> cool - then one could do
>> all sorts of advanced things per group.
>>
>> Cheers,
>> Mike
>>
>> On 2/21/16 12:35 AM, Ian Maxon wrote:
>>
>>> Yiran and I came up with possible answers for these...
>>> For 1) , a function could be used that looks something like this:
>>>
>>> declare function minmax($x){
>>> let $stdv := (avg(for $z in $x return $z*$z) - avg($x) *
>>> avg($x))^(0.5)
>>> for $y in $x
>>> where $y < (2*$stdv) + avg($x)
>>> and $y > avg($x) - (2*$stdv)
>>> return $y
>>> }
>>>
>>>
>>>
>>> And then applied to return a new copy of the list of values,
>>> removing ones
>>> that are outside of 2 stdev.
>>>
>>> For 2), we also did come up with a potential solution ,but the query
>>> fails
>>> to compile (Filed ashttps://
>>> issues.apache.org/jira/browse/ASTERIXDB-1308 )
>>>
>>>
>>> Any thoughts on these queries would be welcome :) 1) especially seems
>>> inefficient to do as a function.
>>>
>>> - Ian
>>>
>>> On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang<[email protected]>
>>> <mailto:[email protected]> wrote:
>>>
>>> Hi Asterix team,
>>>>
>>>> I have two queries I'm struggling with. I'm hoping you could
>>>> provide a
>>>> direction for me. Thanks in advance!
>>>>
>>>> Here is what the data structure looks like:
>>>>
>>>> create type HRMType as closed {
>>>>
>>>> row_id: int32,
>>>>
>>>> sid: int32,
>>>>
>>>> date: date,
>>>>
>>>> day: int32,
>>>>
>>>> time: time,
>>>>
>>>> bpm: int32,
>>>>
>>>> RR: float
>>>>
>>>> };
>>>>
>>>> create dataset HRM (HRMType)
>>>>
>>>> primary key row_id;
>>>>
>>>>
>>>> Previously I have used the time bin function to calculate the
>>>> standard
>>>> deviation of bpm for each time bin:
>>>>
>>>> for $i in dataset HRM
>>>>
>>>> group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day,
>>>> $timebin :=
>>>> interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M"))
>>>> with $i
>>>>
>>>> return {
>>>>
>>>> "sid": $sid,
>>>>
>>>> "gdate": $gdate,
>>>>
>>>> "gday": $gday,
>>>>
>>>> "timebin": $timebin,
>>>>
>>>> "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in
>>>> $i
>>>> return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};
>>>>
>>>> Now I have two things I am hoping to do but need help with:
>>>>
>>>> 1. For each 1-min time bin, remove the bpm values that are above
>>>> the top
>>>> 5% or below the bottom 5%. I thought about using the min/max
>>>> function for a
>>>> few times to achieve this, but realized that it was not a good idea
>>>> because
>>>> in each time bin, the number of instances are not always the same.
>>>> So for
>>>> each 1-min time bin, we do need to calculate the 5% and 95%
>>>> threshold, and
>>>> remove instances accordingly, which I don't know how to do.
>>>>
>>>> 2. After removing the outliers of bpm for each 1-min time bin,
>>>> calculate a
>>>> median absolute deviation (MAD) for each 1-min time bin (as another
>>>> measure
>>>> of variation besides the standard deviation). MAD =
>>>> median(abs(x-median(x)). I'm not sure how to write a query to do
>>>> the median
>>>> function in Asterix.
>>>>
>>>> Thank you so much in advance. Let me know if my questions are clear.
>>>>
>>>> Yiran
>>>>
>>>> --
>>>> Best,
>>>> Yiran
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups
>>>> "asterixdb-dev" group.
>>>> To unsubscribe from this group and stop receiving emails from it,
>>>> send an
>>>> email [email protected]
>>>> <mailto:[email protected]>.
>>>> For more options, visithttps://groups.google.com/d/optout.
>>>>
>>>>
>>
>>
>>
>> --
>> Best,
>> Yiran
>>
>
>