Re: Two query questions on time bin function

Mike Carey Sun, 21 Feb 2016 14:56:18 -0800

ARGH!!!!  This is what we would like you to *not* have to do.  Sorry...
Our aim is to be the Big Data antidote....


@Yingyi:  Maybe you could take a quick peek at the query issue and
see if there is any low-hanging hope there?

@Yiran:  How big are your windows, typically?  (Number of data points.)

Cheers,
Mike


On 2/21/16 2:42 PM, Yiran Wang wrote:

Thank you Mike for your update and suggestions! And thank you Ianagain for working with me.


A little update from my end:

I have been working on query (1) over the weekend for a work-around. Itried to simultaneously calculate the stdev on the new copy of thelist of values with outliers removed, which in nature is the same asthe query (2). So I ran into the same problem that the query did notcompile.

What I did was to export the entire dataset with the outliers removedinto Excel and calculate the stdev in Excel. However, the entiredataset is now 363,466 x 200+ in dimension. Though they do not exceedthe row x col limit in Excel, the memory of my computer is not bigenough to do anything useful without crashing. So I've been breakingthe dataset into smaller parts and working on each separately.


Yiran

On Sun, Feb 21, 2016 at 2:10 PM, Mike Carey <[email protected]<mailto:[email protected]>> wrote:


    Ian,

    Thanks working with Yiran on this!  I think there is "good" and
    bad news w.r.t these queries:

     - The bad news is that they go beyond what we are likely to
    optimize at all well at present,
        as they go beyond what typical DB aggregate functions like
    min/max/avg/count/sum do.
        (I would try forming the groups and then doing these things on
    the groups, but saying
        them in AQL will be tricky, and may lead to queries that hit
    edge cases in the optimizer.
        For some of these my thought was to try using a positional
    variable within a group...?)

     - The "good" news (only for AsterixDB) is that this is exactly
    the sort of inspiration that we
        are looking for in terms of understanding how to better for
    query-based analytics in real
        use cases (and this is a very real one!).

    To quote a short paper I reviewed just this AM on SQL queries kind
    of like these:  "Percentage
    queries are more complex than their conventional counterparts and
    introduce new challenges
    for optimization."  (The paper didn't have an applicable solution
    for us, sadly.)

    A more general facility that I wish we could offer was to do
    grouping in AsterixDB but then
    have the ability to pass a group to (e.g.) R and then get results
    back for the group.  When
    groups are small-ish (like Yiran's windows) that would be pretty
    cool - then one could do
    all sorts of advanced things per group.

    Cheers,
    Mike

    On 2/21/16 12:35 AM, Ian Maxon wrote:

    Yiran and I came up with possible answers for these...
    For 1) , a function could be used that looks something like this:

    declare function minmax($x){
    let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5)
    for $y in $x
    where $y < (2*$stdv) + avg($x)
    and $y > avg($x) - (2*$stdv)
    return $y
    }



    And then applied to return a new copy of the list of values, removing ones
    that are outside of 2 stdev.

    For 2), we also did come up with a potential solution ,but the query fails
    to compile (Filed ashttps://issues.apache.org/jira/browse/ASTERIXDB-1308  )


    Any thoughts on these queries would be welcome :) 1) especially seems
    inefficient to do as a function.

    - Ian

    On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang<[email protected]> 
<mailto:[email protected]>  wrote:

    Hi Asterix team,

    I have two queries I'm struggling with. I'm hoping you could provide a
    direction for me. Thanks in advance!

    Here is what the data structure looks like:

    create type HRMType as closed {

       row_id: int32,

       sid: int32,

       date: date,

       day: int32,

       time: time,

       bpm: int32,

       RR: float

    };

    create dataset HRM (HRMType)

    primary key row_id;


    Previously I have used the time bin function to calculate the standard
    deviation of bpm for each time bin:

    for $i in dataset HRM

    group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
    interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i

    return {

    "sid": $sid,

    "gdate": $gdate,

    "gday": $gday,

    "timebin": $timebin,

    "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in $i
    return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};

    Now I have two things I am hoping to do but need help with:

    1. For each 1-min time bin, remove the bpm values that are above the top
    5% or below the bottom 5%. I thought about using the min/max function for a
    few times to achieve this, but realized that it was not a good idea because
    in each time bin, the number of instances are not always the same. So for
    each 1-min time bin, we do need to calculate the 5% and 95% threshold, and
    remove instances accordingly, which I don't know how to do.

    2. After removing the outliers of bpm for each 1-min time bin, calculate a
    median absolute deviation (MAD) for each 1-min time bin (as another measure
    of variation besides the standard deviation). MAD =
    median(abs(x-median(x)). I'm not sure how to write a query to do the median
    function in Asterix.

    Thank you so much in advance. Let me know if my questions are clear.

    Yiran

    --
    Best,
    Yiran

    --
    You received this message because you are subscribed to the Google Groups
    "asterixdb-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email [email protected]
    <mailto:[email protected]>.
    For more options, visithttps://groups.google.com/d/optout.





--
Best,
Yiran

Re: Two query questions on time bin function

Reply via email to